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Preface 



This volume contains 40 contributed papers accepted for presentation at the 6^^ 
Annual European Symposium on Algorithms (ESA’98), Venice, Italy, August 
24-26, 1998, and the invited lectures by Eli Upfal (Brown) and Jeffrey Vitter 
(Duke). ESA is an annual meeting, with previous meetings held in Bad Honnef, 
Utrecht, Corfu, Barcelona, and Graz, and covers research in the use, design, 
and analysis of efficient algorithms and data structures as it is carried out in 
computer science, discrete applied mathematics and mathematical programming. 
Proceedings of ESA have been traditionally part of the LNCS series of Springer- 
Verlag. The symposium is promoted and coordinated by a steering committee, 
whose current composition is: 

Gianfranco Bilardi, Padua & UIC Kurt Mehlhorn, Saarbriicken 
Josep Diaz, Barcelona Ralf H. Mohring, Berlin 

Giuseppe F. Italiano, Venice Gerhard Woeginger, Graz 



The Program Committee of ESA’98 consisted of: 



Gianfranco Bilardi, Co-chair, 
Padua & UIC 
Jean Daniel Boissonnat, 

Sophia Antipolis 
Kieran T. Herley, Cork 
Dorit Hochbaum, Berkeley 
Giuseppe F. Italiano, Co-Chair, 
Venice 

Ludek Kucera, Prague 
Richard J. Lipton, Princeton 
Yishay Mansour, Tel-Aviv 



Kurt Mehlhorn, Saarbriicken 
Friedhelm Meyer auf der Heide, 
Paderborn 

Grammati Pantziou, Patras 
Mike Paterson, Warwick 
Jose Rolim, Geneva 
Erik M. Schmidt, Aarhus 
Maria Serna, Barcelona 
Paul Vitanyi, Amsterdam 
Gerhard Woeginger, Graz 



There were 131 extended abstracts submitted in response to a call for pa- 
pers. Papers were solicited describing original results in all areas of algorithmic 
research, including but not limited to: approximation algorithms; combinatorial 
optimization; computational biology; computational geometry; databases and 
information retrieval; graph and network algorithms; machine learning; number 
theory and computer algebra; on-line algorithms; pattern matching and data 
compression; symbolic computation. Submissions that reported on experimental 
and applied research were especially encouraged. 

The contributed papers were selected by the program committee, which met 
in Padova, Italy, on April 24 and 25, 1998. The selection of contributed papers 
was based on originality, quality, and relevance to the symposium. Gonsiderable 
effort was devoted to the evaluation of the submissions. Extensive feedback was 
provided to authors as a result, which we hope has proven helpful. However, 
submissions were not refereed in the thorough and detailed way that is customary 
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Preface 



for journal papers. Many submissions represent reports of continuing research. It 
is expected that most of the papers in these proceedings will eventually appear in 
finished form in scientific journals. We believe that the final program of ESA’98, 
organized into ten sessions, represents an interesting cross section of research 
efforts in the area of algorithms. 

We wish to thank the program committee, and all of those who submitted 
abstracts for consideration. We thank all of our many colleagues who helped 
the program committee in evaluating the submissions. Their names appear in a 
separate list, and we apologize to anyone whose name is omitted. ESA’98 could 
not have happened without the dedicated work of the organizing committee, 
consisting of: 

Francesco Bombi, Chair, Padua Andrea Pietracaprina, Padua 

Dora Giammarresi, Venice Fabio Pittarello, Venice 

Salvatore Orlando, Venice Geppino Pucci, Padua 

Marcello Pelillo, Venice Alessandro Roncato, Venice 

The agency Key Congress & Communications acted as a secretariat for the 
symposium. We gratefully acknowledge the generous support of the ESA’98 
sponsors: 



Consiglio Nazionale delle Ricerche 

InfoCamere 

IRT 

Telecom Italia 

Finally, Andrea Pietracaprina and 
editors of these proceedings. 



UNESCO 

Universita di Padova 
Universita Ca’ Foscari di Venezia 

Geppino Pucci have been invaluable co- 



Padua, June 1998 



Gianfranco Bilardi 



Venice, June 1998 



Giuseppe F. Italiano 
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Jeffrey Scott Vitter** 

Center for Geometric Computing, Department of Computer Science 
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j sv@cs . duke . edu 
http : //www . cs . duke . edu/~ jsv/ 



Abstract. Data sets in large applications are often too massive to fit 
completely inside the computer’s internal memory. The resulting in- 
put/output communication (or I/O) between fast internal memory and 
slower external memory (such as disks) can be a major performance bot- 
tleneck. In this tutorial, we survey the state of the art in the design 
and analysis of external memory algorithms (also known as EM algo- 
rithms or out-of-core algorithms or I/O algorithms). External memory 
algorithms are often designed using the parallel disk model (PDM). The 
three machine-independent measures of an algorithm’s performance in 
PDM are the number of I/O operations performed, the CPU time, and 
the amount of disk space used. PDM allows for multiple disks (or disk 
arrays) and parallel CPUs, and it can be generalized to handle cache 
hierarchies, hierarchical memory, and tertiary storage. 

We discuss a variety of problems and show how to solve them efficiently 
in external memory. Programming tools and environments are available 
for simplifying the programming task. Experiments on some newly de- 
veloped algorithms for spatial databases incorporating these paradigms, 
implemented using TPIE (Transparent Parallel I/O programming Envi- 
ronment), show significant speedups over popular methods. 



1 Introduction 

The Input/Output communication (or simply I/O) between the fast internal 
memory and the slow external memory (such as disk) can be a bottleneck in 
applications that process massive amounts of data [51]. One approach to op- 
timizing performance is to bypass the virtual memory system and to develop 
algorithms that explicitly manage data placement and movement, which we re- 
fer to as external memory algorithms, or more simply EM algorithms. The terms 
out- of- core algorithms and I/O algorithms are also sometimes used. 

* Extended abstract. An earlier version appeared as an invited tutorial in Pro- 
ceedings of the 17th Annual ACM Symposium on Prineiples of Database Sys- 
tems (PODS ’98), Seattle, WA, June 1998. A longer version appears in [106]. 
An updated version is available electronically on the author’s web pageat 
http://www.cys.duke.edu/~jsv/. Eeedback is welcome. 

** Supported in part by Army Research Office MURI grant DAAH04-96-1-0013 and 
by National Science Foundation research grant CCR-9522047. 



G. Bilardi et al. (Eds.): ESA’98, LNCS 1461, pp. 1—25, 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 
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In order to be effective, EM algorithm designers need to have a simple but 
reasonably accurate model of the memory system’s characteristics. Magnetic 
disks consist of one or more rotating platters and one read/ write head per platter 
surface. The data are stored in concentric circles on the platters called tracks. To 
read or write a data item at a certain address on disk, the read/ write head must 
mechanically seek to the correct track and then wait for the desired address to 
pass by. The seek time to move from one random track to another is often on 
the order of 5-10 milliseconds, and the average rotational latency, which is the 
time for half a revolution, has the same order of magnitude. In order to amortize 
this delay, it pays to transfer a large collection of contiguous data items, called 
a block. 

Even if an application can structure its pattern of memory use to take full 
advantage of disk block transfer, there is still a substantial access gap between 
internal memory and external memory performance. In fact the access gap is 
growing, since the speed of memory chips is increasing more quickly than disk 
bandwidth and disk latency. The use of parallel processors further widens the 
gap. Storage systems such as RAID are being developed that use multiple disks 
to get more bandwidth [29,59]. 

We can capture the main properties of magnetic disks and multiple disk 
systems by the commonly-used parallel disk model (PDM) introduced by Vitter 
and Shriver [107]: 

N = problem size (in units of data items), 

M = internal memory size (in units of data items) , 

B = block transfer size (in units of data items) , 

D = ^ independent disk drives, 

P = 4 CPUs, 

where M < N, and 1 < DB < M/2. In a single I/O, each of the D disks can 
simultaneously transfer a block of B contiguous data items, li P < D, each of 
the P processors can drive about D/P disks; if D < P, each disk is shared by 
about P/D processors. The internal memory size is M/P per processor. The P 
processors are connected by an interconnection network. 

It is often helpful to refer to some of the above PDM parameters in units of 
blocks rather than in units of data items. We define the lower-case notation 




( 1 ) 



to be the problem size and internal memory size, respectively, in units of disk 
blocks. We assume that the input data are initially “striped” across the D disks, 
in units of blocks, as illustrated in Figure 1, and we require the output data to 
be similarly striped. Striped format allows a file of N data items to be read or 
written in 0(n/D) = 0{N/DB) I/Os, which is optimal. We refer to 0{n/D) 
I/Os as a “linear number of I/Os” in the PDM model. 

The primary measures of performance in PDM are the number of I/O oper- 
ations performed, the internal (parallel) computation time, and the amount of 
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Fig. 1. Initial data layout on the disks, D = 5, B = 2. The input data items are 
initially striped block-by-block across the disks. For example, data items 16 and 17 are 
stored in the second block (i.e., in stripe 1) of disk T> 3 . 



disk space used. For reasons of brevity we will focus in this paper on the number 
of I/Os performed. Most of the algorithms discussed in the paper use optimal 
CPU time, at least for the single-processor case, and 0{n) blocks of disk space, 
which is optimal. 

The track size is a parameter of the disk hardware and cannot be altered;^ 
for most disks it is in the range 50-100 KB. Generally the block transfer size B 
in PDM is chosen to be a significant fraction of the track size so as to amortize 
seek time.^ 

PDM is a good generic programming model that facilitates the design of I/O- 
efficient algorithms, especially when used in conjunction with the programming 
tools discussed in Section 10. More complex and precise disk models have been 
developed, such as the ones by Ruemmler and Wilkes [86], Shriver et al. [91], and 
Barve et al. [20], which distinguish between sequential reads and random reads 
and consider the effects on throughput of features such as disk buffer caches 
and shared buses. In practice, the effects of more complex models can often be 
realized or approximated by PDM with an appropriate choice of parameters. The 
bottom line is that programs that perform well in terms of PDM will generally 
perform well when implemented on real systems. 

The study of the complexity of algorithms using external memory devices 
began more than 40 years ago with Demuth’s Ph.D. thesis on sorting [41, 69]. In 

^ The actual track size for any given disk varies with the radius of the track. Sets 
of adjacent tracks are usually formatted to have the same track size, so there are 
typically a small number of different track sizes for a given disk. 

^ The minimum block transfer size imposed by hardware is usually 512 bytes, but 
operating systems generally use a much larger block size, such as 8 KB. It is possible 
to use an even larger block size and further reduce the relative significance of seek 
and rotational latency, but the wall clock time per I/O will increase accordingly. 
For example, if we set B to be five times larger than the track size, the time per 
I/O will correspond to five revolutions of the disk plus the relatively insignificant 
seek time. (If the disk is smart enough, rotational latency can be avoided, since 
the block spans entire tracks.) Once the block transfer size becomes larger than 
the track size, the wall clock time per I/O grows linearly with the block size. It is 
thus often helpful to think of the block transfer size in PDM as a fixed hardware 
parameter roughly proportional to the track size. The particular block size that 
optimizes performance in an actual implementation will typically vary slightly from 
application to application. 
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the early 1970s, Knuth [69] did an extensive study of sorting using magnetic tapes 
and (to a lesser extent) magnetic disks. At about the same time, Floyd [48, 69] 
considered a disk model akin to PDM for D = 1, P = 1, B = M/2 = for 

constant c > 0, and developed optimal upper and lower I/O bounds for sorting 
and matrix transposition. Savage and Vitter [90] developed an I/O version of 
pebbling for straightline computations, in which the data items are transferred 
in units of blocks. Aggarwal and Vitter [7] generalized Floyd’s I/O model to 
allow simultaneous block transfers, but the model was unrealistic for a single 
disk. They developed matching upper and lower I/O bounds for all parameter 
values for a host of problems. Since the PDM model can be thought of as a 
more restrictive (and more realistic) version of their model, their lower bounds 
apply as well to PDM. Modified versions of PDM that integrate various aspects 
of parallel computation are developed in [75,40]. Surveys of various aspects of 
I/O models and algorithms appear in [92, 11]. 

The same type of bottleneck that occurs between internal memory and ex- 
ternal disk storage can also occur at other levels of the memory hierarchy, such 
as between data cache and level 2 cache, or between level 2 cache and DRAM, or 
between disk storage and tertiary devices. The PDM model can be generalized 
to multilevel memory hierarchies, but for reasons of brevity and emphasis, we 
do not discuss such models here. We refer the reader to [108] and its references. 

In this paper we survey several useful paradigms for solving problems effi- 
ciently in external memory. In Sections 2-9, we discuss distribution and merging 
(for sorting-related problems), distribution sweep (for spatial join and finding all 
nearest neighbors), batched filtering and external fractional cascading (for GIS 
map overlay and red- blue line segment intersection), randomized incremental 
constructions (for intersection and other geometry problems), B-trees (for dic- 
tionaries and one-dimensional range queries), buffer trees (for batched dynamic 
problems and sweep line applications), interval trees (for stabbing queries), and 
R-trees (for spatial applications such as multidimensional range queries and con- 
tour line processing in GIS). 

In Section 10, we describe experiments on two problems arising in spatial 
databases, for which speedups are possible over currently used techniques. We 
use TPIE (Transparent Parallel I/O programming Environment) for our imple- 
mentations. In Section II we describe EM algorithms that can adapt optimally 
to dynamically changing memory allocations. 

2 External Sorting and Related Problems 

The problem of sorting is a central problem in the field of external memory 
algorithms, partly because sorting and sorting-like operations account for a sig- 
nificant percentage of computer use [69] , and also because sorting is an important 
paradigm in the design of efficient EM algorithms. With some technical qual- 
ifications, many problems that can be solved easily in linear time in internal 
memory, such as list ranking, permuting, expression tree evaluation, and con- 
nected components in a sparse graph, require the same number of I/Os in PDM 
as sorting. 
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Theorem 1 ([7, 82]). The average-case and worst-case number of I/Os required 
for sorting N data items using D disks is 




It is conceptually much simpler to program for the single-disk case (D = 1) 
than for the multi-disk case. Disk striping is a paradigm that can ease the pro- 
gramming task with multiple disks. I/Os are permitted only on entire stripes, 
one at a time. For example, in the data layout in Figure 1, data items 20-29 can 
be accessed in a single I/O step because their blocks are grouped into the same 
stripe. The net effect of striping is that the D disks behave as a single logical 
disk, but with a larger logical block size DB. 

Let us consider what happens if we use the technique of disk striping in 
conjunction with an optimal sorting algorithm for one disk. The optimal number 
of I/Os using one disk is 



/ logn \ ^ (N \og{N/B) \ 
\ logm/ " \B\og{M/B)) 



(3) 



The effect of disk striping with D disks is to replace B by DB in (3), which 
yields the I/O bound 



/ N \og{N/DB) \ ^ / n log(n/D) \ 

\DB\og{M/DB) ) ~ ~ \D\og{m/D) ) ' 



The striping I/O bound (4) can be larger than the optimal bound (2) by a multi- 
plicative factor of (logm)/ log(m/D), which is significant when D is on the order 
of m, causing the \og{m/ D) term in the denominator to be very small. In order 
to attain the optimal sorting bound (2) theoretically, we must be able to control 
the disks independently, so that each disk can access a different stripe in the same 
I/O. Sorting via disk striping is often more efficient in practice than more com- 
plicated techniques that utilize independent disks, since the (logm)/log(m/Z?) 
factor may be dwarfed by the additional overhead of using the disks indepen- 
dently [104]. 

In the following two subsections we discuss and analyze new external sort- 
ing algorithms based upon the distribution and merge paradigms. The SRM 
method, which uses a randomized merge technique, outperforms disk striping in 
practice for reasonable values of D (see Section 2.2). In the last two subsections 
we consider the related problems of permuting and Fast Fourier Transform. All 
the methods we discuss, with the exception of Greed Sort in Section 2.2, use the 
disks independently for parallel read operations, but parallel writes are done in 
a striped manner, which facilitates the writing of parity error correction infor- 
mation [29, 59]. The lower bounds are discussed in Section 3. 



2.1 Sorting by Distribution: Simultaneous Online Load Balancings 

Distribution sort is a recursive process in which the data items to be sorted are 
partitioned by a set of S' — 1 partitioning elements into S buckets. All the items 
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in one bucket precede all the items in the next bucket. The individual buckets 
are sorted recursively and then concatenated together to form a single totally 
sorted list. 

The S' — 1 partitioning elements are chosen in such a way that the buckets 
are of roughly equal size. Hence, the bucket sizes decrease by a 0{S) factor from 
one level of recursion to the next, and there are 0(log_g n) levels of recursion. 
During each level of recursion, the data are streamed through internal memory, 
and the S buckets are written to the disks in an online manner. A double buffer 
of size 2B is allocated to each of the S buckets. When one half of the double 
buffer fills, its block is written to disk in the next I/O, and the other half is used 
to store the incoming items. Therefore, the maximum number of buckets (and 
partitioning elements) is S' = 0{M/B) = 0(m), and the resulting number of 
levels of recursion is 6>(log^n). 

It seems difficult to find S = 0{m) partitioning elements using 0{n/D) I/Os 
and guarantee that the bucket sizes are within a constant factor of one another. 
But efficient deterministic methods exist for choosing S = ySn partitioning ele- 
ments [107, 81] , which has the effect of doubling the number of levels of recursion. 
Probabilistic methods based upon random sampling can be found in [45] . 

In order to meet the sorting bound (2), the formation of the buckets at 
each level of recursion must be done in 0{njD) I/Os, which is easy to do for 
the single-disk case. In the more general multiple-disk case, each read step and 
each write step during the bucket formation must involve on the average 0{D) 
blocks. The file of items being partitioned was itself one of the buckets formed 
in the previous level of recursion. In order to read that file efficiently, its blocks 
must be spread uniformly among the disks, so that no one disk is a bottleneck. 
The challenge in distribution sort is to write the blocks of the buckets to the 
disks in an online manner and achieve a global load balance by the end of the 
partitioning, so that the bucket can be read efficiently during the next level of 
the recursion. 

Partial striping is a useful technique that can reduce the amount of informa- 
tion that must be stored in internal memory in order to manage the disks. The 
disks are grouped into clusters of size C and data are written in “logical blocks” 
of size CB, one per cluster. Choosing C = \/D won’t change the optimal sorting 
time by more than a constant factor, but as pointed out earlier, full striping (in 
which C = D) can be nonoptimal. 

In addition to partial striping, Vitter and Shriver [107] use two randomized 
online techniques during the partitioning so that with high probability each 
bucket is well balanced across the D disks. (Partial striping is used so that the 
pointers needed to keep track of the layout of the buckets on the disks can fit 
in internal memory.) The first technique is used when the size N of the file to 
partition is sufficiently large or when DB ^ M . Each parallel write operation 
writes its D blocks in random order to a stripe, with all D\ orders equally likely. If 
N is not large enough, however, the technique breaks down and the distribution 
of each bucket tends to be uneven. For these smaller values of N , Vitter and 
Shriver use a different technique: In one pass, the file is read, one memoryload 
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at a time. Each memory load is randomly permuted and written back to the 
disks in the new order. In a second pass, the file is accessed in a “diagonally 
striped” manner. They show that with very high probability each individual 
“diagonal stripe” contributes about the same number of items to each bucket, 
so an oblivious shuffling pattern can achieve the desired global balance. 

DeWitt et al. [42] present a randomized distribution sort algorithm in a 
similar model to handle the case when sorting can be done in two passes. They 
use a sampling technique to find the partitioning elements and route the items 
in each bucket to a particular processor. The buckets are sorted individually in 
the second pass. 

An even better way to do distribution sort, and deterministically at that, 
is the BalanceSort method developed by Nodine and Vitter [81]. During the 
partitioning process, the algorithm tracks how evenly each bucket has been dis- 
tributed so far among the disks. For each I < b < S and 1 < d < D, let numb 
be the total number of items in bucket b processed so far during the partition- 
ing and let numb{d) be the number of those items written to disk d; that is, 
numb = J2i<d<D'^'^''^b{d). The algorithm is able to write at least half of any 
given memoryload to the disks and still maintain the invariant for each bucket b 
that the [D/2\ largest values of numb{l), numb{2), ..., numb{D) differ by at 
most 1, and hence each numb{d) is at most about twice the ideal value numb/ D. 

2.2 Sorting by Merging 

The merge paradigm is somewhat orthogonal to the distribution paradigm dis- 
cussed above. A typical merge sort algorithm works as follows: In the “run” 
formation phase, the n blocks of data are streamed into memory, one memory- 
load at a time; each memoryload is sorted into a “run,” which is then output to 
stripes on disk. At the end of the run formation phase, there are N/M = n/m 
(sorted) runs, each striped across the disks. (In actual implementations, the 
“replacement-selection” technique can be used to get runs of 2M data items, on 
the average, when M B [69].) 

After the initial runs are formed, the merging phase begins. In each pass of 
the merging phase, groups of R runs are merged together. During each merge, 
one block from each run resides in internal memory. When the data items of a 
block expire, the next block for that run is input. Double buffering is used to 
keep the disks busy. Hence, at most R = 0(m) runs can be merged at a time; 
the resulting number of passes is O(log^n). 

To achieve the optimal sorting bound (2), each merging pass must be done 
in 0{n/D) I/Os, which is easy to do for the single-disk case. In the more general 
multiple-disk case, each parallel read operation during the merging must on the 
average bring in the next 0{D) blocks needed for the merging. The challenge 
is to ensure that those blocks reside on different disks so that they can be read 
in a single I/O (or a small constant number of I/Os). The difficulty lies in the 
fact that the runs being merged were themselves formed during the previous 
merge pass. Their blocks were written to the disks in the previous pass without 
knowledge of how they would interact with other runs in later merges. 
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A perfect solution, in which the next D blocks needed for the merge are 
guaranteed to be on distinct disks, can be devised for the binary merging case 
R = 2 based on the Gilbreath principle [50,69]: The first run is striped in 
ascending order by disk number, and the other run is striped in descending 
order. The next D blocks needed for the output can always be read in by a 
single I/O operation, and thus the amount of buffer space needed for binary 
merging can be kept to a minimum. Unfortunately there is no analog to the 
Gilbreath principle for R > 2, and as we have seen above it is necessary to use 
a large value of R in order to get an optimal sorting algorithm. 

The Greed Sort method of Nodine and Vitter [82] was the first optimal deter- 
ministic EM algorithm for sorting with multiple disks. It handles the case R> 2 
by relaxing the condition on the merging process. In each step, the following two 
blocks from each disk are brought into internal memory: the block bi with the 
smallest data item value and the block 62 whose largest item value is smallest. 
It may be that 61 = &2, in which case only one block is read into memory, and 
it is added to the next output stripe. Otherwise, the two blocks bi and 62 are 
merged in memory; the smaller B items are written to the output stripe, and 
the remaining items are prepended back to the run originally containing bi . The 
resulting run that is produced is only an “approximately” merged run, but its 
saving grace is that no two inverted items are too far apart. A final application 
of Golumnsort [73] in conjunction with partial striping restores total order. 

An optimal deterministic merge sort, with somewhat higher constant factors 
than those of the distribution sort algorithms, was developed by Aggarwal and 
Plaxton [6] , based upon the Sharesort hypercube sorting algorithm [39] . To guar- 
antee even distribution during the merging, it employs two high-level merging 
schemes in which the scheduling is almost oblivious. 

The most practical method for sorting is the simple randomized merge sort 
(SRM) algorithm of Barve et al. [21] (referred to as “randomized striping” by 
Knuth [69]). Each run is striped across the disks, but with a random starting 
point (the only place in the algorithm where randomness is used.) During the 
merging process, the next block needed from each disk is read into memory, 
and if there is not enough room, the least needed blocks are “flushed” (without 
any I/Os required) to free up space. The expected performance of SRM is not 
optimal for all parameter values, but it significantly outperforms the use of disk 
striping for reasonable values of the parameters, as shown in Table 1. Barve et 
al. [21] derive an upper bound on the I/O performance; the precise analysis is 
an interesting open problem. 

2.3 Permuting and Transposition 

Permuting is the special case of sorting in which the key values of the N data 
items form a permutation of {1, 2, ... , N}. 

Theorem 2 ([7]). The average-case and worst-case number of I/Os required 
for permuting N data items using D disks is 

6* (min{^,^log„n}) . (5) 




External Memory Algorithms 



9 





D = 5 


o 

II 


D = 50 


11 


0.56 


0.47 


0.37 


fc = 10 


0.61 


0.52 


0.40 


fc = 50 


0.71 


0.63 


0.51 



Table 1. The ratio of the number of I/Os used by simple randomized merge sort 
(SRM) to the number of I/Os used by merge sort with disk striping, during a merge of 
kD runs. The figures were obtained by simulation; they back up the (more pessimistic) 
analytic upper bound in [21]. 



Matrix transposition is the special case of permuting in which the permuta- 
tion can be represented as a transposition of a matrix from row-major order into 
column-major order. 

Theorem 3 ([7]). The number of I/Os required using D disks to trans- 
pose a p X q matrix from row-major order to column-major order is 
0{{n/D) log„ min{M,p, q, n}). 

Matrix transposition is a special case of a more general class of permutations 
called bit-permute/ complement (BPC) permutations, which in turn is a subset of 
the class of bit-matrix-multiply /complement (BMMC) permutations. BPC per- 
mutations include matrix transposition, bit-reversal permutations (which arise 
in the FFT), vector-reversal permutations, hypercube permutations, and matrix 
reblocking. Cormen et al. [37] characterize the optimal number of I/Os needed 
to perform any given BMMC permutation in terms of its matrix representation, 
and they give an optimal algorithm for implementing it. 

2.4 Fast Fourier Transform 

Computing the Fast Fourier Transform (FFT) in external memory consists of a 
series of I/Os that permit each computation implied by the FFT (or butterfly) 
directed graph to be done while its arguments are in internal memory. A permu- 
tation network computation consists of a fixed pattern of I/Os such that any of 
the A^! possible permutations can be realized; data items can only be reordered 
when they are in internal memory. A permutation network can be realized by a 
series of three FFTs [111]. 

Theorem 4. The number of I/Os using D disks required for computing the N- 
input FFT digraph or an N-input permutation network is given by the same 
bound (2) as for sorting. 

Cormen and Nicol [36] give some practical implementations for one- 
dimensional FFTs based upon the optimal PDM algorithm of [107]. The al- 
gorithms for FFT are simpler than for sorting because the computation is non- 
adaptive in nature, and thus the communication pattern is oblivious. 





10 



J.S. Vitter 



3 Lower Bounds on I/O 

In this section we prove the lower bounds from Theorems 1-4 in Section 2. The 
most trivial batched problem is that of scanning (or streaming or touching) a 
file of N data items, which can be done in a linear number 0{N/DB) = 0{n/D) 
of I/Os. Permuting is a simple problem that can be done in linear time in the 
(internal memory) RAM model, but requires a nonlinear number of I/Os in PDM 
because of the locality constraints imposed by the block parameter B. 

The following proof of the permutation lower bound (5) of Theorem 2 is due 
to Aggarwal and Vitter [7]. The idea of the proof is to measure, for each t > 0, 
the number of distinct orderings that are realizable by at least one sequence of 
t I/Os. The value of t for which the number of distinct orderings first exceeds 
A^!/2 is a lower bound on the average number of I/Os (and hence the worst-case 
number of I/Os) needed for permuting. 

We assume for the moment that there is only one disk, D = 1. Let us consider 
how the number of realizable orderings can change when we read a given disk 
block into internal memory. There are at most B data items in the block, and 
they can intersperse among the M items in internal memory in at most (^) ways, 
so the number of realizable orderings increases by a factor of (^) . If the block 
has never before resided in internal memory, the number of realizable orderings 
increases by an extra B\ factor, since the items in the block can be permuted 
among themselves. (This extra contribution of B\ can only happen once for each 
of the N/B original blocks.) The effect of writing the disk block is considerably 
less than that of reading it. There are at most n + t < TV log W ways to choose 
which disk block is involved in the I/O. Hence, the number of distinct orderings 
that can be realized by some sequence of t I/Os is at most 

(B!)A'/b (^7V(logfV)(^^^^ . (6) 

Setting the expression in (6) to be at least N\/2, and simplifying by taking the 
logarithm, we get 

N log B + t 

We get the lower bound for the case D = 1 by solving for t. The general lower 
bound (5) follows by dividing by D. 

Permuting is a special case of sorting, and hence, the permuting lower bound 
applies also to sorting. In the unlikely case that Blogm = o(logn), the permut- 
ing bound is only fi{N / D), and we must resort to the comparison model to get 
the full lower bound (2) of Theorem 1 [7]. Arge et al. [14] show for the com- 
parison model that any problem with an f2(N log N) lower bound in the RAM 
model requires f?(nlog^ n) I/Os in PDM. However, in the typical case in which 
Blogm = 17 (log n), the comparison model is not needed to prove the sorting 
lower bound; the difficulty of sorting in that case arises not from determining 
the order of the data but from permuting (or routing) the data. 



logN + Blog^ ) = n{NlogN). 
B 



(7) 
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The same proof used above for permuting also applies to permutation net- 
works, in which the communication pattern is oblivious. Since the choice of disk 
block is fixed for each t, there is no NlogN term as there is in (6) (and corre- 
spondingly there is no additive logA'^ term as there is in (7)). Hence, when we 
solve for t, we get the lower bound (2) rather than (5). In contrast with sort- 
ing, the lower bound follows directly from the counting argument, without any 
dependence upon the comparison model for the case B log m = o(log n) . The 
same lower bound also applies to the FFT, since permutation networks can be 
built from a series of three FFTs. The lower bound for transposition involves a 
potential argument based on a togetherness relation [7]. 

Chiang et al. [30] and Arge [10] discuss lower bound reductions for several 
graph problems. Problems like list ranking and expression tree evaluation have 
the same nonlinear PDM lower bound as permuting. This situation is in contrast 
with the RAM model, in which the same problems can all be done in linear time. 

The lower bounds mentioned above assume that the data items are in some 
sense “indivisible” in that they are not split up and reassembled in some magic 
way to get the desired output. It is conjectured that the sorting lower bound (2) 
remains valid even if the indivisibility assumption is lifted. However, for an ar- 
tificial problem related to transposition, Adler [2] showed that removing the 
indivisibility assumption can lead to faster algorithms. A similar result is shown 
by Arge and Miltersen [15] for the problem of determining if the N data item 
values are distinct. 

4 Matrix and Grid Computations 

Dense matrices are generally represented in memory in row-major or column- 
major order. Matrix transposition, which is the conversion of a matrix from one 
representation to the other, was discussed in Section 2.3. For certain operations 
such as matrix addition, both representations work well. However, for standard 
matrix multiplication and LU decomposition, neither representation is appro- 
priate; a better representation is to block the matrix into square \^B x 
submatrices. 



Theorem 5 ([60, 90, 107, 110]). The number of I/Os required to multiply 
two k X k matrices or to compute the LU factorization of a k x k matrix is 



0(///mln{k,^/M}DB). 



Hong and Kung [60] and Nodine et al. [80] give optimal EM algorithms 
for iterative grid computations, and Leiserson et al. [74] show how to reduce 
the number of I/Os of naive multigrid implementations by a factor. 

Gupta et al. [56] show how to derive efficient EM algorithms automatically for 
computations expressed in tensor form. Different techniques are called for when 
the matrices are sparse. The reader is referred to [35,76,87,97,96,104,106] for 
further study of EM matrix algorithms. 




12 



J.S. Vitter 



5 Batched Problems in Computational Geometry 

Problems involving massive amounts of geometric data are ubiquitous in spa- 
tial databases [72,88,89], geographic information systems (GIS) [72,88,100], 
constraint logic programming [65, 66], object-oriented databases [112], statistics, 
virtual reality systems, and computer graphics [88]. NASA’s Earth Observing 
System, for example, produces petabytes (10^® bytes) of raster data per year. A 
major challenge is to develop mechanisms for processing the data, or else much 
of it will be wasted. 

For systems of this size to be efficient, we need efficient EM algorithms and 
data structures for basic problems in computational geometry. Luckily, many 
problems on geometric objects can be reduced to a small number of base prob- 
lems, such as computing intersections, convex hulls, or nearest neighbors. Useful 
paradigms have been developed for solving these problems. 

For brevity in the remainder of this paper, we deal only with the single-disk 
case D = 1. The I/O bounds for the batched problems can often be cut by a 
factor of 0{D) using the techniques of Section 2. 

Theorem 6. The following hatched problems and several related problems in- 
volving N items and Q queries can he solved using 

0{{n + q)log^n + z) (8) 

I/Os, where q — \Q/Bd\ (or else q — 0 if Q is not defined) and z = \ZIB\ for 
output size Z: 

1. Answering Q orthogonal range queries on N points, 

2. Finding all intersections between N nonintersecting red line segments and 
N nonintersecting blue line segments. 

3. Computing the pairwise intersections of N segments, 

4- Constructing the 2-D and 3-D convex hull of N points, 

5. Triangulation of N points, 

6. Performing Q point location queries in a planar subdivision of size N, 

7. Finding all nearest neighbors for a set of N points, 

Goodrich et al. [52], Arge et al. [18], Arge et al. [17], and Grauser et al. [38] 
develop EM algorithms for these problems using the following EM paradigms 
for batched problems: 

Distribution sweeping: a generalization of the distribution paradigm of Section 2 
for externalizing plane sweep algorithms; 

Persistent B-trees: an offline method for constructing an optimal-space persis- 
tent version of the B-tree data structure (see Section 7), yielding a factor of B 
improvement over the generic persistence techniques of Driscoll et al. [43]. 

Batched filtering: a general method for performing simultaneous external mem- 
ory searches in data structures that can be modeled as planar layered di- 
rected acyclic graphs and in external fractionally cascaded data structures; 
useful for 3-D convex hulls and batched point location. 

External fractional cascading: an EM analog to fractional cascading. 




External Memory Algorithms 



13 



Online filtering: a technique based upon the work of Tamassia and Vitter [95] 
for online queries in data structures with fractional cascading. 

External marriage-before- conquest: an EM analog to the technique of Kirk- 
patrick and Seidel [68] for output-sensitive convex hull constructions. 

Randomized incremental construction with gradations, a localized version of the 
incremental construction paradigm of Clarkson and Shor [34] . 

The distribution sweep paradigm is fundamental to sweep line processes. For 
example, we can compute the pairwise intersections of N orthogonal segments in 
the plane by the following recursive distribution sweep: At each level of recursion, 
the plane is partitioned into 0{m) vertical strips, each containing &{N/m) of the 
segments’ endpoints. We use a horizontal sweep line to process the N segments 
from top to bottom. When a vertical segment is encountered by the sweep line, 
the segment is inserted into the appropriate strip. When a horizontal segment h is 
encountered, we report /I’s intersections with all the “active” vertical segments in 
the strips that are spanned completely hy h. (A vertical segment is “active” if it 
is intersected by the current sweep line; vertical segments that are found to be no 
longer active are deleted from the strips.) The remaining end portions of h (which 
partially span a strip) are passed recursively to the next level, along with the 
vertical segments. After the initial sorting preprocessing, each of the 0(log„ n) 
levels of recursion requires 0{n) I/Os, yielding the desired bound (8). Arge et 
al. [17] develop a unified approach to distribution sweep in higher dimensions. 

Crauser et al. [38] use an incremental randomized construction to attain the 
I/O bound (8) for several other geometry problems. 

6 Batched Problems on Graphs 

The first work on EM graph algorithms was by Ullman and Yannakakis [98] 
for the problem of transitive closure. Chiang et al. [30] consider a variety of 
graph problems, several of which have upper and lower I/O bounds related to 
permuting. One key idea Chiang et al. exploit is that efficient EM algorithms 
can often be developed by a sequential simulation of a parallel algorithm for the 
same problem. Sorting is done periodically to reblock the data. In list ranking, 
which is used as a subroutine in the solution of several other graph problems, the 
number of working processors in the parallel algorithm decreases geometrically 
with time, so the number of I/Os for the entire simulation is proportional to 
the number of I/Os used in the first phase, which is given by the sorting bound 
&{^{n/D)log^n). Dehne et al. [40] and Sibeyn and Kaufmann [94] show how 
to achieve the same I/O bound if logB = 0(log{M/B)') by exploiting coarse- 
grained parallel algorithms whose data access characteristics permit the periodic 
sortings to be done with a linear number of I/Os. 

For list ranking, the optimality of the EM algorithm in [30] assumes that 
^/mlogm = f7(logn), which is usually true. That assumption can be removed by 
use of the buffer tree data structure [9] (see Section 7.1). A practical, randomized 
implementation of list ranking appears in [93] . Recent work on other EM graph 
algorithms appears in [1, 10, 71, 54] . The problem of how to store graphs on disks 
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for efficient traversal is discussed in [79,4]. Constructing classification trees in 
external memory for data mining is discussed in [109]. 

The I/O complexities of several of the basic graph problems con- 
sidered in [30,98] remain open, including connectivity, topological sort- 
ing, shortest paths, breadth-first search, and depth-first search. For exam- 
ple, for a graph with V vertices and E edges, the best-known EM algo- 
rithms for depth- first search and transitive closure require + V) and 

0{V^ ^JE/MJ B) I/Os, respectively. Connected components can be determined 
in 0(min{-^ log„ (log log„^ ■§}) I/Os deterministically and in only 
0(^ log„^ I/Os probabilistically. The interesting connection between the par- 
allel domain and the EM domain suggests that there may be relationships be- 
tween computational complexity classes related to parallel computing (such as 
P-complete problems) and those related to I/O efficiency. 

7 Dynamic Multiway Tree Data Structures 

Tree-based data structures arise naturally in the dynamic online setting, in which 
the data can be updated and queries must be processed immediately. Binary 
trees have a host of applications in the RAM model. In order to exploit block 
transfer, trees in external memory generally use a block for each node, which 
can store 0{B) pointers and data values. A tree of degree B‘^, for fixed c > 0, 
with n leaf nodes has height [clog^n]. The well-known B-tree due to Bayer 
and McCreight [23,69] is a balanced multiway tree with height ~ log^ in 
which each node has degree 0{B). It supports dynamic dictionary operations 
and one-dimensional range queries in 0(log^ n+t) I/Os per operation. Persistent 
versions of B-trees have been developed by Becker et al. [24] and Varman and 
Verma [101]. Lomet and Salzberg [77] explore mechanisms to add concurrency 
and recovery to B-trees. 

Arge and Vitter [19] give a useful variant of B-trees called weight-balanced B- 
trees with the property that the number of data items in any subtree of height h 
is 0{a^), for some fixed parameter a of order B. (By contrast, the sizes of 
subtrees at level in a regular B-tree can differ by a multiplicative factor that is 
exponential in h.) When a node on level h gets rebalanced, no further rebalancing 
is needed until its subtree is updated I7(a^) times. Weight-balanced B-trees 
were originally developed as part of an optimal dynamic EM data structure for 
stabbing queries (interval trees) and segment trees, but they can also be used to 
get improvements for algorithms developed in the RAM model [19,54]. 

Gross! and Italiano [55] develop a multidimensional version of B-trees, called 
cross trees, that combine the data-driven partitioning of B-trees at the upper 
levels of the tree with the space-driven partitioning of methods like quad trees 
at the lower levels of the tree. The data structure also supports the dynamic 
operations of split and concatenate. 

7.1 Buffer Trees 

Many batched problems in computational geometry can be solved by plane 
sweep techniques. For example, in Section 5 we showed how to compute or- 
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thogonal segment intersections by keeping track of the active vertical segments 
during the sweep process. If we use a B-tree to store the active vertical segments, 
each insertion and query uses O(log^n) I/Os, resulting in a huge I/O bound of 
O^Nlog^n), which is more than B times larger than the desired bound. One 
solution suggested in [105] is to use a binary tree in which items are pushed 
lazily down the tree in blocks of B items at a time. The binary nature of the 
tree results in a data structure of height ~ log 2 n, yielding a total I/O bound of 
0(n log 2 n), which is still nonoptimal by a significant logm factor. 

Arge [9] developed the elegant buffer tree data structure to support hatched 
dynamic operations such as in this example, where the queries do not have to 
be answered right away or in any particular order. The buffer tree is a balanced 
multiway tree, but with degree 0{m). Its key distinguishing feature is that each 
node has a buffer that can store M items. Items in a node are not pushed down 
to the children until the buffer fills. Emptying the buffer requires 0{m) I/Os, 
which amortizes the cost of distributing the items to the respective children. 
Each item incurs an amortized cost of 1/B I/Os per level. Buffer trees can be 
used as a subroutine to the standard sweep line algorithm to get an optimal 
EM algorithm for orthogonal segment intersection. Arge showed how to extend 
buffer trees to implement external memory segment trees by reducing the node 
degrees to 0{ffm) and by introducing “multislabs” in each node. 

Buffer trees provide a natural amortized implementation of priority queues 
for use in applications like discrete event simulation, sweeping, and list ranking. 
Brodal and Katajainen [27] develop a worst-case optimal priority queue, in the 
sense that every sequence of B insert and deletejmin operations require only 
0(log„ n) I/Os. 

7.2 R-trees 

The R-tree of Guttman [57] and its many variants are an elegant generalization 
of the B-tree for storing geometric objects. Each node in the tree has associated 
with it a bounding box (or bounding polygon) of all the elements in its subtree. 
The bounding boxes of sibling nodes can overlap. If the tree is being used for 
point location, for example, a point may lie within the bounding box of several 
children of the current node in the search. In that case, the search must proceed 
to all such children. 

Several heuristics for where to insert new items into the R-tree and how to 
rebalance have been proposed. Extensive surveys appear in [53,49]. The R*-tree 
of Beckmann et al. [25] seems to give best overall query performance. Con- 
structing an R*-tree by repeated insertions, however, is extremely slow. A faster 
alternative is to use the Hilbert R-tree of Kamel and Faloutsos [62,63]. Each 
item is labeled with the position of its center on the Hilbert space-filling curve, 
and a B-tree is built on the totally ordered labels. Bulk loading a Hilbert R- 
tree is therefore easy once the center points are presorted, but the quality of 
the Hilbert R-tree in terms of query performance is not as good as that of an 
R*-tree, especially for higher-dimensional data [26,64]. 

Arge et al. [13] and Bercken et al. [99] have independently devised fast bulk 
loading methods for R*-trees that are based upon buffer trees. The former 
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method is especially efficient and can even support dynamic batched updates 
and queries. Experiments with this technique are discussed in Section 10. 

8 Range Searching 

Multidimensional range search is a fundamental primitive in several large ge- 
ometric applications and it provides indexing support for new constraint data 
models and object-oriented data models. (See [66] for background.) 

Range searching in a batched setting has been discussed in Section 5, so 
in this section we concentrate on the important online case. R-trees and the 
methods mentioned in Section 7.2 perform well in many typical cases but have 
no worst-case bounds. Unfortunately, for many search problems it is very difficult 
to develop theoretically optimal algorithms; many open problems remain. The 
primary theoretical challenges are four- fold: 

1. to get a query complexity close to 0(log3 n) I/Os, 

2. to get an output complexity of 0(z) I/Os, where 2 : = \Z/B'\ and Z is the 

output size, 

3. to use close to linear disk storage space, and 

4. to support dynamic updates. 

Arge and Vitter [19] design an interval tree based upon the weight-balanced 
B-tree that meets all four goals. It solves the problems of stabbing queries and 
dynamic interval management, following up on earlier work by Kanellakis et 
al. [66]. The EM interval tree is used by [32] to extract at query time the bound- 
ary components of the isosurface (or contour) of a surface. A data structure for 
a related problem, which has optimal output complexity, appears in [4]. 

Ramaswamy and Subramanian [85] introduce the notion of path caching 
to develop EM algorithms for two-sided and three-sided 2-D range queries. 
Their algorithms are optimal in criteria 1 and 2 but with higher storage over- 
heads and amortized and/or nonoptimal update bounds. Subramanian and 
Ramaswamy [85] present the P-range tree data structure for the three-sided 
and (general) four-sided 2-D range queries. The update times are amortized 
and slightly nonoptimal, and the space overhead for four-sided queries is 
0(n(logn)/loglogs n). 

On the other hand, Subramanian and Ramaswamy [85] prove for an external 
memory version of the pointer machine model that no EM algorithm can achieve 
both criteria 1 and 2 using less than 0(n(log n) / log log^ n) space, and thus their 
algorithm is near-optimal. Hellerstein et al. [58] consider a generalization of the 
layout-based lower bound argument of Kanellakis et al. [66] for studying the 
tradeoff between disk space overhead and query performance; the model does not 
distinguish, however, between query complexity and output complexity. Further 
work appears in [70]. 

When the data structure is restricted to contain only a single copy of each 
item, Kanth and Sing [67] show for a restricted class of index-based trees that 
range queries in the worst case require -I- z) I/Os; a matching upper 

bound is provided by the cross tree data structure [55] of Section 7. 
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Fundamentally different techniques are needed for higher dimensions and 
for non-orthogonal queries. Vengroff and Vitter [103] use a partitioning ap- 
proach and compressed representation to get the first EM algorithms for three- 
dimensional searching that are within a poly log factor of optimal. Agarwal et 
al. [3] give near-optimal bounds for halfspace range searching in two dimen- 
sions and some variants in higher dimensions using another type of partitioning 
structure. 

Callahan et al. [28] develop a dynamic EM data structure for use in several 
online problems such as finding an approximately nearest neighbor and main- 
taining the closest pair of vertices. Numerous other data structures have been 
developed for range queries and related problems on spatial data. We refer to [78, 
5] for a survey. 

9 String Processing 

Digital trie-based structures, in which branching decisions at each node are made 
based upon the values of particular bits in strings, are effective for string pro- 
cessing in internal memory. In EM applications, what is needed is a multiway 
digital structure. Unfortunately, if the strings are long, there is no space to store 
them completely in each node, and if pointers to strings are used, the number 
of I/Os per node access will be large. 

Ferragina and Gross! [46, 47] develop an elegant generalization of the B-tree, 
called the SB-tree, for storing strings. The key problem they address is how to 
represent a single node that does 5-way branching. In a B-tree, 5 — 1 items 
are stored in the node to guide the searching, and each item is assumed to be 
unit-sized. However, strings can occupy many characters, so there is not enough 
space to store 5—1 strings in each node. Pointers to the 5—1 strings could 
be stored, but access to them during search would require more than a constant 
number of I/Os. 

Their solution, called the blind trie, is based on the data structure of Ajtai 
et al. [8] that achieves 5-way branching with a total storage of 0(5) characters. 
The resulting query time to search in an SB-tree for a string of length £ is 
0(log^ n+£fB), which is optimal. Ferragina and Gross! apply SB-trees to string 
matching, prefix search, and substring search. Farach and Ferragina [44] show 
how to construct SB-trees, suffix trees, and suffix arrays on strings of length N 
using 0(n log„ n) I/Os, which is optimal. Glark and Munro [33] give an alternate 
approach to suffix trees. 

Arge et al. [12] consider several models for the problem of sorting K strings of 
total length N in external memory. They develop efficient sorting algorithms in 
these models, making use of the SB-tree, buffer tree techniques, and a simplified 
version of the SB-tree for merging called the lazy trie. They show somewhat 
counterintuitively that for sorting short strings (i.e., strings whose length is less 
than 5) the complexity depends upon the total number of characters, whereas 
for long strings the complexity depends upon the total number of strings. 
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10 Empirical Comparisons 

In this section we examine the empirical performance of algorithms for two prob- 
lems that arise in spatial databases. TPIE (Transparent Parallel I/O program- 
ming Environment) [102, 104] is used as the common implementation platform.^ 
TPIE is a comprehensive software package that helps programmers develop high- 
level, portable, and efficient implementations of EM algorithms. 

In the first experiment, three algorithms are implemented in TPIE for the 
problem of rectangle intersection, which is often the first step in a spatial 
join computation. The first method, called Scalable Sweeping-Based Spatial 
Join (SSSJ) [16], is a robust new algorithm based upon the distribution sweep 
paradigm of Section 5. The other two methods are Partition-Based Spatial-Merge 
(QPBSM) used in Paradise [84] and a new modification called MPBSM that uses 
an improved dynamic data structure for intervals [16]. 

The algorithms were tested on several data sets. The timing results for the 
two data sets in Figures 2(a) and 2(b) are given in Figures 2(c) and 2(d), re- 
spectively. The first data set is the worst case for sweep line algorithms; a large 
fraction of the line segments in the file are active (i.e., they intersect the cur- 
rent sweep line). The second data set is a best case for sweep line algorithms. 
The two PBSM algorithms have the disadvantage of making extra copies. SSSJ 
shows considerable improvement over the PBSM-based methods. On more typ- 
ical data, such as TIGER/line road data, experiments indicate that SSSJ and 
MPBSM run about 30% faster than QPBSM. 

In the second experiment, three methods for building R-trees are evalu- 
ated in terms of their bulk loading time and the resulting query performance. 
The three methods tested are a newly developed buffer R-tree method [13] 
(labeled “buffer”), the naive sequential method for construction into R*-trees 
(labeled “naive”), and the best update algorithm for Hilbert R-trees (labeled 
“Hilbert”) [64]. 

The experimental data came from TIGER/line road data sets from four U.S. 
states. The starting point in the experiment was a single R-tree for each of the 
four states, containing 75% of the road data objects for that state. Using each 
of the three algorithms, the remaining 25% of the objects were inserted into the 
R-tree, and the construction time was measured. The query performance of each 
resulting R-tree was tested by posing rectangle intersection queries, using rect- 
angles taken from TIGER hydrographic data. The results in Table 2 show that 
the buffer R-tree has faster construction time than the Hilbert R-tree (the previ- 
ous best method for construction time) and similar or better query performance 
than repeated insertions (the previous best method for query performance). 

Other recent experiments involving the paradigms discussed in this paper 
appear in [31, 61]. 



^ The TPIE software distribution is available at no charge on the World Wide Web 
at http://www.cs.dnke.edu/TPIE/. 
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Number of rectangles 



Fig. 2. Comparison of Scalable Sweeping-Based Spatial Join (SSSJ) with the original 
PBSM (QPBSM) and a new variant (MPBSM) (a) Data set 1 consists of tall and skinny 
(vertically aligned) rectangles, (b) Data set 2 consists of short and wide (horizontally 
aligned) rectangles, (c) Running times on data set 1. (d) Running times on data set 2. 

11 Dynamic Memory Allocation 

It is often the case that operating systems dynamically change the amount of 
memory allocated to a program. The algorithms in the earlier sections assume a 
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Data 


Update 


Update with 25% of the data 


Set 


Method 


Building 


Querying 


Packing 




naive 


259, 263 


6,670 


64% 


RI 


Hilbert 


15,865 


7,262 


92% 




buffer 


13,484 


5,485 


90% 




naive 


805, 749 


40,910 


66% 


CT 


Hilbert 


51,086 


40, 593 


92% 




buffer 


42, 774 


37, 798 


90% 




naive 


1,777,570 


70, 830 


66% 


NJ 


Hilbert 


120,034 


69, 798 


92% 




buffer 


101,017 


65, 898 


91% 




naive 


3, 736, 601 


224, 039 


66% 


NY 


Hilbert 


246, 466 


230, 990 


92% 




buffer 


206,921 


227, 559 


90% 



Table 2. Summary of the I/O costs for R-tree update. 



fixed memory allocation; they must make use of virtual memory if the memory 
allocation is reduced, often causing a severe performance hit. 

Barve and Vitter [22] discuss the design and analysis of EM algorithms that 
adapt gracefully to changing memory allocations. In their model, without loss of 
generality, a program V is allocated memory in phases: During the ith phase, V 
is allocated rrii blocks of internal memory, and this memory remains allocated 
to V until V completes 2mi I/O operations, at which point the next phase begins. 
The process continues until V finishes execution. The duration for each memory 
allocation phase is long enough to allow all the memory in that phase to be used. 

For sorting, the lower bound approach of (6) implies that 2mi log rrii = 

O(nlogn). We say that V is dynamically optimal for sorting if 2mi logrrii = 

O(nlogn) for all possible sequences mi, m 2 , ... of memory allocation. Intu- 
itively, if V is dynamically optimal, no other program can perform more than a 
constant number of sorts in the worst-case for the same allocation sequence. 

Barve and Vitter define the model in generality and give dynamically op- 
timal strategies for sorting, matrix multiplication, and buffer trees operations. 
Their work represents the first theoretical model of dynamic allocation for EM 
algorithms. Pang et al. [83] and Zhang and Larson [113] give memory-adaptive 
merge sort algorithms, but their algorithms handle only special cases and can 
be made to perform poorly for certain patterns of memory allocation. 

12 Conclusions 

In this paper we have described several useful paradigms for the design and 
implementation of efficient external memory algorithms. The problem domains 
we have considered include sorting, permuting, EFT, scientific computing, com- 
putational geometry, graphs, databases, geographic information systems, and 
text and string processing. Algorithmic challenges remain in virtually all these 
problem domains. One example is the design and analysis of new algorithms 
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for efficient use of multiple disks. Optimal bounds are not yet determined for 
basic graph problems like topological sorting, shortest paths, breadth-first and 
depth-first search, and connectivity. There is an interesting connection between 
problems that have good I/O speedups and problems that have fast and work- 
efficient parallel algorithms. 

For some of the problems that can be solved optimally up to a constant fac- 
tor, the hidden constant is too large for the algorithm to be of use in practice, 
and simpler approaches are needed. A continuing goal is to develop optimal EM 
algorithms and to translate theoretical gains into observable improvements in 
practice. New architectures such as networks of workstations and hierarchical 
storage devices present interesting challenges and opportunities. Work is begin- 
ning, for example, on extensions of TPIE to such domains. The ability to adapt 
to changing memory allocations may be important for the use of EM algorithms 
in practice. 
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Abstract. Past research in theoretical computer science has focused 
mainly on static computation problems, where the input is known be- 
fore the start of the computation and the goal is to minimize the number 
of steps till termination with a correct output. Many important processes 
in today’s computing are dynamic processes, whereby input is continu- 
ously injected to the system, and the algorithm is measured by its long 
term, steady state, performance. Examples of dynamic processes include 
communication protocols, memory management tools, and time sharing 
policies. Our goal is to develop new tools for the design and analyzing 
the performance of dynamic processes, in particular through modeling 
the dynamic process as an infinite stochastic processes. 



1 Introduction 

Rigorous analysis of the dynamic performance of computer processes is one of 
the most challenging current goals in theory of computation. Past research in 
theoretical computer science has focused mainly on static computation problems. 
In static computation the input is known at the start of the computation and the 
goal is to minimize the number of steps till the process terminates with a correct 
output. Many important processes in today’s computing are dynamic processes, 
whereby input is continuously injected to the system, and the algorithm (which 
is not supposed to terminate at all) is measured by its long term (steady state) 
performance. Examples of dynamic computer processes include: 

— Contention resolution protocols. 

— Routing and communication algorithms. 

— Memory management tools such as cashing and paging. 

— Time sharing and load balancing protocols. 



1.1 Performance measures for dynamic processes 

In evaluating the performance of dynamic algorithms one needs to distinguish 
between algorithms that satisfy all incoming requests and algorithms that may 
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drop requests. In the first case the primary performance measure of interest is 
the algorithm’s stability conditions. Roughly speaking, a system is stable if in 
the long run the number of new arriving requests is no larger than the number of 
requests processed by the system. The goal of the analysis here is to characterize 
the (most general) input conditions (deterministic or stochastic) under which 
the system is stable. The corresponding measure for algorithms that may drop 
requests is the number of requests (as a function of the incoming stream of 
requests) that the algorithm successfully satisfied (or equivalently, the number of 
requests that the algorithm needs to reject in order to keep the system stable). A 
second criteria, which is important for both type of algorithms is speed, measured 
by the (maximum or expected) time to satisfy a request. 

An algorithm is usually analyze in terms of its (worst case or average) per- 
formance with respect to a given class of inputs. However, in many dynamic 
algorithms it is useful to study instead the competitive ratio performance of the 
algorithm. Competitive ratio analysis compares the performance of the dynamic 
(on-line) algorithm to the performance of an “off-line” algorithm that “knows” 
in advance the whole sequence of requests. The quality of an on-line algorithm 
is measured by the maximum, over all possible input sequences, of the ratio 
between its performance and the performance of an off-line on the same input 
sequence. Thus, competitive analysis measures the quality of the algorithm with 
respect to the optimal one, rather then measuring the actual performance of 
the system. This measure is particular useful in cases when no algorithm can 
perform well on all instances of the set of inputs. 

1.2 Stochastic analysis 

As in the case of static algorithms, randomness is introduced into dynamic com- 
putation through either the algorithm, the input or both. Many interesting dy- 
namic protocols are random, a well known example is the Aloha contention reso- 
lution protocol which is used in the Ethernet and other similar applications. 
An execution of a dynamic random algorithm, even on a fixed input sequence, 
defines an infinite stochastic process in which a state at a given step depends on 
the history of the process. Analysis of such a process requires different approach 
and tools from the ones used for analyzing the finite execution of a randomized 
static computation. 

Worst case analysis rarely gives an interesting insight on the actual per- 
formance of a dynamic algorithm. A worst case adversary can generate ex- 
tremely hard sequences of requests, and the performance of the algorithm on 
these “pathological” cases does not accurately represent the efficiency of the al- 
gorithm. To offset the affect of rare cases it is useful to analyze the performance 
of dynamic systems under some stochastic assumptions on the stream of inputs. 
Such assumptions are more realistic in the dynamic setting, in particular when 
requests are originated by a number of independent processors, than in the case 
of static analysis. The stochastic process that control the stream of requests 
might be stationary, periodic, or even bursty. The goal is to obtain results that 
are valid under the weakest set of assumptions. The advantage and practicality 
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of this approach has been well demonstrated by the achievements of queueing 
theory Our goal is to applied similar techniques to dynamic computer pro- 
cesses that do not fit the queueing theory settings. Competitive ratio analysis is 
another alternative to standard worst case analysis. However, even in the case 
of competitive analysis, stochastic input may lead to better evaluation of the 
algorithm’s performance. One example is greedy load balancing. This procedure 
is very efficient in practice, while its worst case competitive ratio is high. On 
the other hand it can be shown that under some stochastic assumptions the 
average competitive ratio of greedy load balancing is very low Q, thus giving 
one possible explanation for the “real-life” performance of this procedure. 

Stochastic analysis of dynamic processes builds on the rich theory of stochas- 
tic processes, in particular queueing theory, and the theory of stationary pro- 
cesses. However, in many cases one needs to develop new tools to address the 
specific problems pose by computer related processes, which are discrete and 
involved complicated dependency conditions. 



2 Dynamic Balanced Allocations 

In Q we studied a surprising combinatorial phenomena which we term balanced 
allocations. In the dynamic version of balanced allocation we have n boxes and 
n balls. In each step one ball, chosen uniformly at random, is removed from the 
system, and a new ball appears. The new ball comes with d possible destinations, 
chosen independently at random, and it is placed into the least full box, at the 
time of the placement, among the d possible destinations. 

The case d = 1 corresponds to the classic occupancy problem. It is easy 
to show that in the stationary distribution, with high probability the fullest 
box has 0(logn/loglogn) balls. The analysis of the case d > 2 is significantly 
harder, since the locations of the current n balls might depend on the locations 
of balls that are no longer in the system. A surprising result which we proved 
in Q shows that for d > 2, in the stationary distribution, with high probability 
no box contains more than In Inn/ In d -I- 0(1) balls. Thus, an apparent minor 
change in the random allocation process results in an exponential decrease in 
the maximum occupancy per location. Several recent works have built on our 
original result. Mitzenmacher studied a continuous time variant of our model 
in which balls arrive according to a Poisson process and are removed with an 
exponential distribution. Czumaj and Stemann showed that a more efficient 
variant of our placement procedure results in similar maximum load. 

The dynamic balanced allocation process has a number of algorithmic appli- 
cations: 

Dynamic dictionary. The efficiency of a hashing technique is measured 
by two parameters: the expected and the maximum access time. Our approach 
suggests a simple hashing technique, similar to hashing with chaining. We call 
it 2-way chaining. It has 0(1) expected, and O(loglogn) maximum access time. 
We use two random hash functions. The two hash functions define two possible 
entries in the table for each key. The key is inserted to the least full location, at 
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the time of the insertion. Keys in each entry of the table are stored in a linked 
list. The expected insertion and look-up time is 0(1), and with high probability 
the maximum access time is In n Inn/ In 2 -h 0(1), versus the 0(logn/loglogn) 
time when one random hash function is used. 

Dynamic Resource Allocation. Consider a scenario in which a user or a 
process has to choose on-line between a number of identical resources (choosing a 
server to use among the servers in a network; choosing a disk to store a directory; 
etc.). To find the least loaded resource, users may check the load on all resources 
before placing their requests. This process is expensive, since it requires sending 
an interrupt to each of the resources. A second approach is to send the task 
to a random resource. This approach has minimum overhead, but if all users 
follow it, the difference in load between different servers will vary by up to a 
logarithmic factor. The balanced allocation process suggests a more efficient 
solution. If each user samples the load of two resources and sends his request 
to the least loaded, the total overhead is small, and the load on the n resources 
varies by only a O(loglogn) factor, an exponential improvement over the pure 
random allocation. 

3 Dynamic Flow Control for Packet Routing With 
Bounded Buffers 

Most theoretical work on routing algorithms has focused on static routing: A 
set of packets is injected into the system at time 0, and the routing algorithm 
is measured by the time it takes to deliver all the packets to their destinations, 
assuming that no new packets are injected in the meantime (see Leighton 
for an extensive survey). In practice however, networks are rarely used in this 
“batch” mode. Most real-life networks operate in a dynamic mode whereby new 
packets are continuously injected into the system. Each processor usually con- 
trols only the rate at which it injects its own packets and has only a limited 
knowledge of the global state. This situation is better modeled by a stochastic 
paradigm whereby packets are continuously injected according to some inter- 
arrival distribution, and the routing algorithm is evaluated according to its long 
term behavior, goal is to develop algorithms that perform close to optimal for a 
variety of inter-arrival distributions. 

Several recent articles have addressed the dynamic routing problem, in the 
context of packet routing on arrays on the hypercube and the but- 

terfly and general networks Except for Q, the analyzes in these works 
assumes a Poisson arrival distribution and requires unbounded queues in the 
routing switches (though some works give a high probability bound on the size 
of the queue used ^^^J) . Unbounded queues allow the application of some tools 
from queuing theory (see ^^^]) and help reduce the correlation between events 
in the system, thus simplifying the analysis at the cost of a less realistic model. 

In Q we focused on the design and analysis of efficient flow control mechanism 
for dynamic packet routing in networks with bounded buffers at the switching 
nodes, a setting that more accurately models real networks. Our goal was to 
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build on the vast amount of work that has been done for static routing in order 
to obtain results for the dynamic setting. In Q we developed a general technique 
in which a static algorithm for a network with bounded buffers is augmented with 
a simple flow control mechanism to obtain a provably efficient dynamic routing 
algorithm on a similar network with bounded buffer. The crucial step in |~| is a 
general theorem showing that any communication scheme (a routing algorithm 
and a network) that satisfies a given set of conditions, defined only with respect 
to a finite history is stable up to a certain inter-arrival rate. The theorem also 
bounds the expected routing time under these conditions. 

This technique was applied in | to a number of dynamic routing scenarios 
on the butterfly network. In particular we gave the first provable routing protocol 
for the butterfly network with bounded buffer that is stable for injection rate 
that is within a constant factor of the hardware bandwidth. Similar results are 
obtained in for dynamic routing on the mesh. Another result in shows 
that our general technique can be applied to the adversarial input model of 
y. In that model, instead of probabilistic assumptions on the input there is an 
absolute bound on the number of packets generated in any time interval and 
must traverse any particular edge. Our technique gives a dynamic randomized 
routing algorithm for a butterfly with bounded buffer that is optimal (up to 
constant factors) in that model Q. 

4 Dynamic Virtual Circuit Allocation 

Communication protocols for high-speed high bandwidth networks (such as the 
ATM protocol) are based on virtual circuit switching. The speed of the network 
does not allow for on-line routing of individual packets. Instead, upon estab- 
lishing a connection, bandwidth is allocated along a path connecting the two 
endpoints for the duration of the connection. This “virtual circuits” are set up 
on a per-call basis and are disconnected when the call is terminated. Efficient 
utilization of the network depends on the allocation of virtual circuits between 
pairs of nodes so that no link is overloaded beyond its capacity. 

As in other routing problems we distinguish between a static and a dynamic 
version. In the static version all the requests are given at once and must be si- 
multaneously satisfied. In practice network are rarely used in the “batch mode” 
modeled by the static problem. Real-life network performance is better modeled 
by a dynamic process whereby requests for connection are continuously arriving 
at the nodes of the network. A connection has a duration time, and once the 
communication has terminated its bandwidth can be used for another connec- 
tion. A dynamic circuit switching problem is thus characterized by a number 
of parameters: the network topology, the channels physical bandwidth, the in- 
jection rate distribution of new requests and the distribution of the duration of 
connections. 

Using a random walk approach we develop in |M| a simple and fully dis- 
tributed protocol for dynamic path selection on bounds degree expander graphs. 
For the analysis we adopt the stochastic model assumed in the design of most 
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long-distance telephone networks Requests arrives according to a Poisson 
process, and the duration of a connection is exponentially distributed. Our so- 
lution achieves an almost optimal utilization of the edges under this stochastic 
model. 

Out approach differs from the work on admission control in that we 

do not reject requests. All requests are eventually satisfied in our model, but 
not immediately. In contrast, in the admission control model a request is either 
immediately satisfied or it is rejected. Our approach better models most types 
of computer communication, while the admission control approach is a better 
model for human (telephone) communication. 



5 Stochastic Contention Resolution with Short Delays 

One of the few processes that has been studied in the past in dynamic settings 
are contention resolution protocols and in particular the backoff based solutions. 

The contention resolution problem is best defined in terms of multiple access 
channels. There are n senders and m receivers, at each of a series of time steps, 
one or more senders may generate a packet. A packet when generated has a 
destination — a unique receiver to which it must be delivered. Any sender may 
attempt to send a packet to any receiver at any step, but a receiver may only 
receive one packet in a step. If a receiver is sent more than one packet in a step 
(a collision), all packets sent to that receiver are lost and the senders notified 
of the loss. The senders must then try to send these packets again at a future 
step. There is no explicit communication between the senders for coordination 
the transmissions; the only information that senders have is the packet(s) they 
have waiting for transmission, and the history of losses. A packet can only be 
transmitted directly from its sender to its receiver; intermediate hops are dis- 
allowed. The case m = 1 is a classical instance of sharing a common resource 
such as a bus or an Ethernet channel (the shared bus is modeled by the single 
“receiver”). The case with m = n has been studied recently in the static setting 
under the name Optically Connected Parallel Computer, or OCPC 

Of primary interest, in the dynamic setting, is whether a protocol can sustain 
packet arrivals at some rate without instability. For stable protocols, two quanti- 
ties are of interest: (1) the maximum arrival rate A that can be sustained stably 
(measured as a fraction of the hardware capacity), and (2) the delay, defined 
to be the maximum over all senders of the expected number of steps from the 
generation of a packet to its delivery, in the steady state. Delay is of particular 
importance in high-speed communications applications such as video and ATN 
networks. 

Most previous analysis focused on backoff protocols. The binary exponential 
backoff protocol used in the Ethernet was proposed by Metcalfe and Boggs 
Aldous Q showed that for any positive constant A, binary exponential backoff is 
unstable if the number of senders is infinite. Kelly showed that any polyno- 
mial backoff protocol is unstable for infinitely many senders. Hastad, Leighton 
and Rogoff^J studied systems with a finite number of senders, they showed that 
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binary exponential backoff is unstable for A slightly larger than 0.567 even for a 
system with a finite number of senders, and that polynomial backoff protocol is 
stable for any A < 1 and for any finite number of senders. 

A major drawback of all the backoff protocols is the long expected delay 
in delivering packets (at least linear in the number of senders) as was shown in 
In we present the first contention resolution protocol with o(n) expected 
delay, our protocol is stable for a constant injection rate and the expected delay 
is logarithmic in the number of senders. Two recent works have build on 

our results to obtain constant expected delay. 

References 

1. D. Aldous. Ultimate instability of exponential back-off protocol for acknowledg- 
ment based transmission control of random access communication channels. IEEE 
Transactions on Information Theory, Vol. IT-33, 1987, pp. 219-223. 

2. M. Andrew, B. Awerbuch, A Fernandez, J. Kleinberg, T. Leighton, and Z. Liu. 
Universal stability results for greedy contention-resolution protocols. Proceedings 
of the 37th Annual Synp. on Foundations of Computer Science, pages 380-389, 
1997. 

3. B. Awerbuch, Y. Azar, and S. Plotkin. Throughput competitive online routing. 
In Proceedings of the 34th IEEE Conference on Foundations of Computer Science, 
pages 32-40, 1993. 

4. B. Awerbuch, R. Gawlick, T. Leighton, and Y. Rabani. On-line admission control 
and circuit routing for high-performance computing and communication. Proceed- 
ings of the 35th Annual Synp. on Foundations of Computer Science, October 1994, 
pp 412-423. 

5. Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. In 
Proceedings of the 26th Annual ACM Symposium on Theory of Computing, pages 
593-602, 1994. 

6. Y. Azar, J. Naor, and R. Rom. The competitiveness of on-line assignments. In 
Proceedings of the 3rd Annual ACM-SIAM Symposium on Discrete Algorithms, 
pages 203-210, 1992. 

7. A. Borodin, J. Kleinberg, P. Raghavan, M. Sudan and D. P. Williamson Adversarial 
queuing theory. Proceedings of the 28th Annual ACM Symposium on Theory of 
Computing, pp. 376-385, 1996. 

8. A. Z. Broder and E. Upfal. Dynamic Deflection Routing in Arrays. Proceedings of 
the 28th Annual ACM Symposium on Theory of Computing, pp. 348-355, 1996. 

9. A.Z. Broder, A.M. Frieze, and E. Upfal. “A general approach to dynamic packet 
routing with bounded buffers.” Proceedings of the 37th IEEE Symp. on Founda- 
tions of Computer Science. Burlington, 1996, pp. 390-399. 

10. A.Z. Broder, A.M. Frieze, and E. Upfal. “Static and dynamic path selection on 
expander graphs: a random walk approach”. Proceedings of the 29th ACM Symp. 
on Theory of Computing. El Paso, 1997. 

11. A. Broder, A. Frieze, and E. Upfal. “Dynamic packet routing on arrays with 
bounded buffers”. Third Latin American Symposium on Theoretical Informatics 
- LATIN ’98 Campinas, Brazil. April 1998. In Springer-Verlag Lecture Notes in 
Computer Science 1380, pp 273-281, 1998. 

12. R.L. Cruz. A calculus for network delay, part I: Network elements in isolation. 
IEEE Trans, on Information Theory, Vol. 37, pages 114-131, 1991 



Design and Analysis of Dynamic Processes: A Stochastic Approach 



33 



13. R.L. Cruz. A calculus for network delay, part II: Network analysis in isolation. 
IEEE Trans, on Information Theory, Vol. 37, pages 132-141, 1991 

14. A. Czumaj and V. Stemann. “Randomized Allocation Processes”. Preprint, 1997. 

15. M. Dietzfelbinger, A. R. Karlin, K. Mehlhorn, F. Meyer auf der Heide, H. Rohn- 
ert, and R. E. Tarjan. Dynamic perfect hashing: Upper and lower bounds. In 
Proceedings of the 29th IEEE Conference on Foundations of Computer Science, 
pages 524-531, 1988. 

16. L.A. Goldberg and P.D. MacKenzie. Contention Resolution with Guaranteed Con- 
stant Expected Delay. Preprint, 1997. 

17. J. Goodman, A.G. Greenberg, N. Madras, and P. March. Stability of Binary 
Exponential Backoff. J. of the ACM, Vol. 35 (1988) pages 579-602. 

18. A.G. Greenberg, P. Flajolet, and R.E. Ladner. Estimating the Multiplicities of 
Conflicts to Speed Their Resolution in Multiple Access Channels. J. of the ACM, 
Vol. 34, 1987, pp. 289-325. 

19. J. Hastad, T. Leighton, and B. Rogoff. Analysis of backoff protocols for multiple 
access channels. Proceedings of the 19th ACM Symp. on Theory of Computing, 
1987, pp. 241-253. 

20. M. Harcol-Balter and P. Black. Queuing analysis of oblivious packet routing net- 
works. Procs. of the 5th Annual ACM-SIAM Symp. on Discrete Algorithms. Pages 
583-592, 1994. 

21. M. Harcol-Balter and D. Wolf. Bounding delays in packet-routing networks. 
Procs. of the 27th Annual ACM Symp. on Theory of Computing, 1995, pp. 248-257. 

22. R. M. Karp, M. Luby, and F. Meyer auf der Heide. Efficient PRAM simulation on a 
distributed memory machine. In Proceedings of the 2fth Annual ACM Symposium 
on Theory of Computing, pages 318-326, 1992. 

23. A. Kamath, O. Palmon, and S. Plotkin. Routing and Admission Control in General 
Topology Networks with Poisson Arrivals. SODA 96. 

24. F.P. Kelley. Blocking probabilities in large circuit-switching networks. Adv. Appl. 
Prob., 18:573-505, 1986. 

25. F.P. Kelly. Stochastic models of computer communication systems. J. Royal Sta- 
tistical Soc. B, Vol. 47, 1985, pp. 379-395. 

26. L. Kleinrock. Queueing systems. Wiley, New York, 1975. 

27. N. Kahale and T. Leighton. Greedy dynamic routing on arrays. Procs. of the 6th 
Annual ACM-SIAM Symp. on Discrete Algorithms. Pages 558-566, 1995. 

28. T. Leighton. Average case analysis of greedy routing algorithms on arrays. Procs. of 
the Second Annual ACM Symp. on Parallel Algorithms and Architectures. Pages 
2-10, 1990. 

29. F. T. Leighton. Introduction to Parallel Algorithms and Architectures. Morgan- 
Kaufmann, San Mateo, CA 1992. 

30. F.T. Leighton and S. Rao. Circuit switching: a multi-commodity flow based ap- 
proach. Proc. of 9th International Parallel Processing Symposium, 1995. 

31. P.D. MacKenzie, C.G. Plaxton, and R. Rajaraman. On contention resolution pro- 
tocols and associated probabilistic phenomena. Proceedings of the 26th ACM 
Symposium on Theory of Computing, 1994, pp. 153-162. 

32. R. Metcalfe and D. Boggs. Ethernet: distributed packet switching for local com- 
puter networks. Communication of the ACM, Vol. 19, 1976, pp. 395-404. 

33. M. Mitzenmacher. Bounds on the greedy algorithms for array networks. Procs. of 
the 6th Annual ACM Symp. on Parallel Algorithms and Architectures. Pages 346- 
353, 1994. 




34 



E. Upfal 



34. M. Mitzenmacher. Load balancing and density dependent jump Markov processes. 
Procs. of the 37th IEEE Annual Symp. on Foundations of Computer Science, pages 
213-222, October 1996. 

35. M. Paterson and A. Srinivasan. Contention resolution with bounded delay. Proc. of 
the 34 th Annual IEEE Symp. on Foundation of Computer Science, pages 104-113, 
1995. 

36. P. Raghavan and E. Upfal. St9chastic contention resolution with short delays. 
Proc. of 24 th ACM Symp. on Theory of Computing, pages 229-237, 1995. 

37. C. Scheideler and B. Voecking Universal continuous routing strategies. Procs. of 
the 8 th Annual ACM Symp. on Parallel Algorithms and Architectures. 1996. 

38. G. D. Stamoulis and J. N. Tsitsiklis. The efficiency of greedy routing in hypercubes 
and butterflies. Procs. of the 6 th Annual ACM Symp. on Parallel Algorithms and 
Architectures. Pages 346-353, 1994. 




Car-Pooling as a Data Structuring Device: 
The Soft Heap* 



Bernard Chazelle 



Department of Computer Science 
Princeton University 
Princeton, NJ 08544, USA 
chazelleScs .princeton.edu 



Abstract. A simple variant of a priority queue, called a soft heap, is 
introduced. The data structure supports the usual operations: insert, 
delete, meld, and findmin. In order to beat the standard information- 
theoretic bounds, the soft heap allows errors: occasionally, the keys of 
certain items are artificially raised. Given any 0 < e < 1/2 and any 
mixed sequence of n operations, the soft heap ensures that at most en 
keys are raised at any time. The amortized complexity of each operation 
is constant, except for insert, which takes 0(log 1/e) time. The soft heap 
is optimal. Also, being purely pointer-based, no arrays are used and no 
numeric assumptions are made on the keys. The novelty of the data struc- 
ture is that items are moved together in groups, in a data-structuring 
equivalent of “car pooling.” The main application of the data structure 
is a faster deterministic algorithm for minimum spanning trees. 



1 Introduction 

We design a simple variant of a priority queue, called a soft heap. The data 
structure stores items with keys from a totally ordered universe, and supports 
the operations: 

— create (5): Create a new, empty soft heap. 

— insert (5, x): Add new item x to S. 

— meld (5i, 52): Form a new soft heap with the items stored in 5i, S 2 (assumed 
to be disjoint), and destroy Si and S 2 - 

— delete (5, x): Remove item x from S. 

— findmin (5): Return an item in S with minimum key. 

The soft heap may, at any time, increase the value of certain keys. Such keys, 
and by extension, the corresponding items, are called corrupted. Corruption is 
entirely at the discretion of the data structure and the user has no control over it. 
Naturally, f indmin returns the minimum current key, which might be corrupted. 
The benefit is speed: during heap updates, items travel together in packets in a 
form of “car pooling.” 

* This work was supported in part by NSF Grant GGR-93-01254, NSF Grant GGR- 
96-23768, ARO Grant DAAH04-96-1-0181, and NEG Research Institute. 
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Theorem 1. Fix any parameter 0 < e < 1/2, and beginning with no prior data, 
consider a mixed sequence of operations that includes n inserts. There is a soft 
heap such that the amortized complexity of each operation is constant, except for 
insert, which takes 0(log 1/e) time. At most en items are corrupted at any given 
time. In a comparison-based model, these bounds are optimal. 

Note that this does not mean that only en items are corrupted in total. 
Because of deletes, in fact, most items could end up corrupted. If we set e < 
1/n, then the soft heap behaves like a regular heap with logarithmic insertion 
time. The soft heap is purely pointer-based: no arrays are used, and no numeric 
assumptions on the keys are required. In this regard, it is fundamentally different 
from previous work on approximate data structures, eg, Q. The soft heap was 
designed with a specific application in mind, minimum spanning trees: it is the 
main vehicle for an algorithm Q that computes a minimum spanning tree of a 
connected graph in time 0(rn a log a), where a = a(m, n) is a functional inverse 
of Ackermann’s function and n (resp. m) is the number of vertices (resp. edges). 

A variant of the soft heap allows the corruption to be independent of the 
number of insertions. The proof is quite complicated and involves a lengthy 
amortized analysis omitted from this extended abstract. 

Theorem 2. Fix any parameter 0 < e < 1/2, and beginning with no prior data, 
consider a mixed sequence of operations that includes f findmins and d deletes. 
There is a soft heap such that the amortized complexity of each operation is 
constant, except for insert, which takes 0(l/e^) time. At most e{f d) items 
are corrupted at any given time. 

2 The Data Structure 

Recall that a binomial queue Q of rank fc is a rooted tree of 2^ nodes: it is 
formed by the combination of two binomial queues of rank k — 1, where the root 
of one becomes the new child of the other root. Just as a Fibonacci heap [T] is 
a collection of binomial queues, a soft heap is a sequence of modified binomial 
queues, called soft queues. The modifications come in two ways. 

— A soft queue h is a binomial queue with subtrees possibly missing. The 
binomial queue from which it is derived is called the master queue of h. The 
rank of a node of h is its number of children in the master queue. It is an 
upper bound on the number of children in h. 

— A node v may store several items, in fact, a whole item-list, denoted Ly. The 
node key of v, denoted key(u), indicates the value of all the current keys of 
the items in it is an upper bound on the original keys. We fix a parameter 
r = r{e), and we require that all corrupted items be stored at nodes of rank 
greater than r and in item-lists of size at least two. 

The heai| consists of a doubly-linked list si, . . . , s^: each cell Si has two 
extra pointers: one to the root of a distinct queue, and another to the root of 

For brevity we drop the “soft.” 
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2,4 (4) 




Fig. 1. Two binomial queues of rank 2 combine to make one binomial queue of 
rank 3. The soft queue is missing the two light edges. Corrupted item-lists are 
both of size two; node keys are in parentheses. 



minimum key among all Vj’s (j > i). The latter is called the prefix-min pointer 
of Si- We require that rank(ri) < • • • < rank(rm). By extension, the rank of a 
queue (resp. heap) refers to the rank of its root (resp. r^). 



3 The Heap Operations 

To create a new, empty heap requires no work. To insert a new item, we create 
an uncorrupted one-node queue, and we meld it into the heap (see below). We 
delete an item “lazily” by marking it. Finally, the two remaining operations work 
like this: 

~ meld (5i, 52 ): We dismantle the heap of lesser rank, say S 2 , by meld- 
ing each of its queues into 5i . To meld a queue of rank k into 5i , we 
look for the smallest index i such that rank(ri) > fc. If there is no 
such i, we add a new cell past the last one in 5i, with a pointer to 
the queue’s root. If rank(ri) > k, we insert the cell right before Si, 
instead. Otherwise, we meld the two queues into one of rank A: -I- 1, 
by making the root with the larger key a new child of the other root. 

If rank(ri+i) = k + 1, a new conflict arises. We repeat the process as 
long as necessary like carry propagation in binary addition. Finally, 
we update the prefix-min pointers between si and the last cell vis- 
ited. When melding not a single queue but a whole heap, the last 
step can be done at the very end in just one pass through 5i. 

— findmin: The prefix-min pointer at si leads to the smallest current 
key in the heap. The trouble is, all the items at that node rj might 
be marked deleted. In that case, we must discard this item-list and 
refill it with other “live” items. To do that, we call the procedure 
succ(Ai), where h is the queue rooted at rj, and follow suit with 
another findmin. 
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succ (h) 

Lroot{h) <— r <— 0; 

if h has no child 

then set key(root(ft.)) to oo and return; 
1. succ(^); 

if key(root(A)) > key(root(B)) 

then exchange A and B, and clean up; 
T ^ T U L^oot(A)', 

if loop-condition holds then goto 1; 

^root(h) ^ ^ • 



The procedure succ moves items up the queue by grouping them together to 
achieve the “car pooling” effect mentioned earlier. The price to pay for increased 
efficiency is corruption. In the pseudo-code below, B denotes the subtree rooted 
at the highest-ranking child of root(/i), while A refers to h deprived of B and the 
edge leading to it; note that rank(B) might be much smaller than rank(ft,). The 
“loop-condition” statement is what makes soft heaps special. Without it, succ 
would be indistinguishable from the standard delete-min operation of a binomial 
queue. 

Here is a line-by-line explanation. The item-list at root(/i) contains only 
deleted items, so it is emptied out. The (local) variable T is initialized for the 
car-pooling about to take place. If h has no child, then having just lost the item- 
list at its root, we assign the root an infinite key in anticipation of its imminent 
removal. Otherwise, we recurse within H, treating it as a queue of same rank as 
B. This provides root(H) with a brand-new item-list (possibly empty) and a new 
node key (possibly oo). If the heap ordering is now violated at root(ft-), we swap 
A and B, ie, we move B to become rooted at root(/i) and finally we exchange 
the names A and B to go back to the original notation. Next, we clean up: if 
the key of root(i?) is now infinite, we discard B from h and from the heap. Note 
that although root(/i) loses a child, its rank remains unchanged. We append the 
new item-list of A to T. 

Next, we identify the new A and B\ recall that B is the subtree rooted at 
the highest-ranking child of root(/i). If no cleanup was necessary then rank(H) 
remains unchanged, else it drops. In the extreme case, h no longer has a child 
after the cleanup: then the new B is not defined. The most interesting instruction 
comes next. The loop-condition holds if it is being tested for the first time in 
the call to succ(ft.) (ie, branching is only binary), if the new B is well defined, 
and if 



0 < rank(H) — r <2 



rank(/i) — r 






These inequalities refer to the old A. If the loop-condition holds, then the next 
time we append L{Al) to T, all items in T will be corrupted as they adopt the 
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new key of root(A). Note that this is the only spot where corruption takes place. 
Finally, if ft, is a bona-fide queue, and not a sub-queue called recursively, we must 
also clean up the heap. If the key of root(ft) is infinite, we discard ft. We also 
update all the prefix-min pointers between si and the cell pointing to ft. Recall 
that even if the root of ft has lost children its rank does not decrease, and so ft 
stays in place within the soft heap. 

4 Fighting Corruption 

To bound the number of corrupted items in the soft heap, we prove that 



Until the first call to succ, all item-lists have size one, and the inequality holds. 
Afterwards, simple inspection shows that all operations have either no effect 
on iQ or sometimes a favorable one (eg, meld). All of them, that is, except for 
succ, which could potentially cause a violation. We show that this is not the 
case. If succ(ft) calls succ(A) only once, the item-list of A migrates to a higher- 
ranking node by itself and by induction Q is satisfied. Otherwise, the item-list 
at root(ft) becomes L^oot(A) U L^oot(A'), and 0 < i — r < 2[{k — r)/2j, where 
£ = rank(A) > rank(A') and k = rank(ft). By induction, the new item-list at ft 
is, therefore, of size at most 2 • which proves Q. 

Lemma 1. At any time during a mixed sequence of operations that includes n 
inserts, the soft heap contains at most corrupted items. 

Proof. We begin with a simple observation. If N is the node set of a binomial 
queue of rank k then (trivial proof by induction), 

^ ^ ^ra,nk.(v) /2 ^ 2^+2 ^ _ ^kj2 ^ 2 ^ 

v£N 

Now, recall that the ranks of the nodes of a queue ft are derived from the cor- 
responding nodes in its master queue ft'. So, the set R (resp. R') of nodes of 
rank at least r in ft (resp. ft') is such that |i?| < |i?'|. Within ft', the nodes of 
R' number a fraction at most 1/2’'“^ of all the leaves. Some of these leaves may 
not appear in ft, but they once existed in the queue, with each one holding a 
distinct item. Summing over all master queues, we find that 



There is no corrupted item at any rank < r, and so by Q their total number 
does not exceed 



\L„\ <max{l,2L(‘-“‘=(’')-’’)/2J}. 



( 1 ) 




( 3 ) 



h' 




h' vGR' 



( 4 ) 
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Each R' forms a binomial queue by itself, where the rank of node v becomes 
rank(?;) — r. Since a binomial queue of rank k has 2^ nodes, then by (QQ, the 
sum in Q at most ^ 

5 The Running Time 

Only meld and succ need to be looked at, all other operations being trivially 
constant-time. Assigning one credit per queue takes care of the carry-like ripples 
during a meld. Indeed, two queues of the same rank combine into one, which 
releases one credit to pay for the work. Updating prefix-min pointers can take 
time, however. Specifically, ripples aside, the cost of melding Si and S 2 is pro- 
portional to the smaller rank of the two. The entire sequence of melds can be 
modeled as a binary tree. A leaf z denotes a heap (its cost is 1). An internal 
node z indicates the melding of two heaps. Since heaps can grow only through 
melds, the size of the resulting heap at z is proportional to the number N{z) of 
descendents. The cost of node z (ie, of the meld) is 1 -I- logmin{ N{x),N{y) }, 
where x and y are the left and right children of z|lt is a staple of algorithm 
analysis Q that adding together all these costs gives a total melding cost linear 
in the size of the tree. 

Finally, we show that the cost of all calls to succ is 0(rn). Consider an 
execution of succ(/i). If root(fi) has at least two children, then after the first 
call succ(A), no cleanup is necessary and the new B is well defined. By looking 
at the computation tree, it follows immediately that, excluding the updating of 
prefix-min pointers, the running time is 0{rC), where C is the number of times 
the loop-condition succeeds. Each time the loop-condition is satisfied, both calls 
succ (A) bring finite node keys to the root and two nonempty item-lists are 
merged into T. There can be at most n — 1 such merges, therefore C < n and 
our claim holds. 

We ignored the cost of updating prefix-min pointers after each call to succ. 
To keep it in 0(n), we must make a few minor modifications to the algorithm. 
We maintain the following invariant on the root of any soft queue: its number 
of children should be at least half its rank. This makes the cost of prefix-min 
updating negligible. Indeed, each update takes 0(rank(/i)) time, which is now 
dominated by the cost of succ(fi) itself. 

Only succ, by removing nodes, can upset the new invariant. To restore it for 
h is easy: we meld back every sub-queue of h whose root is a child of root(/i). If 
the root of h is not corrupted, then we meld it back, too. Otherwise, we set its 
node key to 00 and we store its item-list separately in a reservoir. For example, 
we can add to the list si, . . . , Sm a cell sq pointing to the reservoir. Of course, 
after all the melding is done, we check for the invariant again (which might still 
be violated because of nodes becoming roots) and we repeat as long as necessary. 
Note that melding two soft heaps is done as before; in addition, the two reservoirs 
are linked together in constant time. 

^ We use the fact that the rank is exactly the logarithm of the number of nodes in the 
master queue. 
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Our previous analysis of melds shows that their cost is linear in their number. 
How many new melds do we create now? Consider the missing children of root(/i) : 
in the master queue, at least one of these children is of rank at least [rank(/i)/2j . 
The leaves at or below that child have lost their items: this cannot happen to 
them again, so they can be charged for the melding costs. Their number is at 
least to which is more than enough to account for the (at most) 

rank(/i)/2 new melds. 

By y the number of corrupted items migrating into the reservoir after dis- 
mantling h is at most 2 hank(ft)-r)/ 2 ^ which is at most a fraction 2^“’’/^ the num- 
ber of leaves of the previous paragraph. It follows from LemmaHthat the total 
number of corrupted items is bounded (conservatively) by -|- njT'~^ . 

Setting r = 5 -I- 2|"log(l/e)] proves Theorem ^(except for the optimality claim). 
Remark: The storage is linear in the number of insertions n, but not necessarily 
in the actual number of items present. If this is a problem, here is a simple fix: as 
soon as the number of live items falls below a fixed fraction of n, we dismantle the 
data structure entirely, raise all corrupted keys to infinity, and move their items 
into the reservoir. A charging scheme similar to the one above easily accounts 
for the resulting increase in corruption. This modified soft heap is optimal in 
storage, and as shown below, in time. 



6 Optimality 

To complete the proof of Theorem ^ we show that the soft heap is optimal. 
Consider a sequence of n inserts (with distinct keys), followed by n pairs of the 
form: findmin, delete returned item. Assume that at most en items are corrupted 
at any given time; n is taken large enough, and without loss of generality, 1/e lies 
between a large constant and ^Jn. We also assume that each item knows at which 
time it is corrupted (if at all). We show that in linear time enough information 
can be collected that would take superlinear time to gather from scratch. Divide 
the sequence of n pairs into time intervals Ti , . . . , T; , each consisting of |" 2en] 
pairs; without loss of generality, we may assume that n = \2en\l. Let Si be 
the set of items deleted during T^, and let Ui be the subset of Si that includes 
only items that were at some point uncorrupted during T^. Finally, let Xt be 
the smallest original key among the items oiUi, and let Si be the corresponding 
item; for convenience, we put xq = — oo and xi+\ = oo. Given an item s G St 
whose original key lies in [xj,Xj+i), we define p(s) = \i — j\. Some simple facts: 

(1) \Ui\ > \2en] — en > en: Because at most en items are corrupted at the 
beginning of Ti. 

(2) The Xi’s appear in increasing order, and the original key of any item in Ui 
lies in [xi, Xi+i): Since s^+i is uncorrupted during Ti, its original key Xi+\ is 
at least the current (and hence, the original) key of any item deleted during 
T,. 

(3) P('S) \ s G Si\Ui} < 2n: Given s G Si \ Ui, let [xj, Xj+i) be the interval 
containing the original key of s. As we just observed, the original key of s 
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is less than Xi+\, and therefore, j < i. To avoid being selected by findmin, 
the item s must have been in a corrupted state during the deletion of Xk, 
for any j < k < i {\i any such k exists). The total number of corrupted 
items during the deletions of Xi, ... ,xi is at most enl, and therefore so is 
the sum of distances i — j — 1 = p{s) — 1 over all such items s. It follows that 
P(^) — + n < 2n, hence our claim. 

(4) The number of items whose original keys fall in [xi,Xi+i) is less than Gen: 
Suppose that the item does not belong to Si U S'i+i. It cannot be selected by 
findmin and deleted before the beginning of Tj, since Si was not corrupted 
yet. By the time Si+i was deleted, the item in question must have been 
corrupted (it cannot have been deleted yet). So, there can be at most sn 
such items. Thus, the total number of items with original keys in [xi, Xi+\) 
is at most 2|'2en] + en. 

Next, for each item .s € St\ Ui, we search which interval [xj,Xj+\) contains 
its original key, which we can do in 0{p{s) + 1) time by sequential search. By 
(1-4), this means that in 0{n) postprocessing time we have partitioned the 
set of original keys into disjoint intervals, each containing between en and Gen 
keys. Next, in 0{l) time, we locate which interval contains the original key of 
rank \2en~\k, for 1 <k <1. Finally, by using linear selection-finding, we find the 
original key of that rank, all of which takes 0(n) time. Let ni = • • • = n; = |"2£n] ; 
a standard counting argument shows that any comparison tree for computing 
these percentiles has height at least 



log 




l7(nlog 1/e). 
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Abstract. In this paper we discuss a variation of the classical Huff- 
man coding problem: finding optimal prefix-free codes for unequal letter 
costs. Our problem consists of finding a minimal cost prefix-free code 
in which the encoding alphabet consists of unequal cost (length) letters, 
with lengths a and /3- The most efficient algorithm known previously 
required ) time to construct such a minimal-cost set of n 

codewords. In this paper we provide an time algorithm. 

Our improvement comes from the use of a more sophisticated modeling 
of the problem combined with the observation that the problem possesses 
a “Monge property” and that the SMAWK algorithm on monotone ma- 
trices can therefore be applied. 

1 Introduction 

The problem of finding optimal prefix- free codes for unequal letter costs (and 
the associated problem of constructing optimal lopsided trees) is an old and 
hard classical one. The problem consists of finding a minimal cost prefix- free 
code in which the encoding alphabet consists of unequal cost (length) letters, of 
lengths a and (3, a < (3. The code is represented by a lopsided tree, in the same 
way as a Huffman tree represents the solution of the Huffman coding problem. 
Despite the similarity, the case of unequal letter costs is much harder then the 
classical Huffman problem; no polynomial time algorithm is known for general 
letter costs, despite a rich literature on the problem, e.g., However there 

are known polynomial time algorithms when a and j3 are integer constants Q. 

The problem of finding the minimum cost tree in this case was first studied 
by Karp Q in 1961 who solved the problem by reduction to integer linear pro- 
gramming, yielding an algorithm exponential in both n and (3. Since that time 

* The work of the second author was partially supported by Hong Kong RGC CERG 
grant 652/95E, that of the third author by NSF grant CCR-9503441 
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there has been much work on various aspects of the problem such as; bounding 
the cost of the optimal tree, Altenkamp and Mehlhorn Q, Kapoor and Reingold 
[y] and Savari ^9; the restriction to the special case when all of the weights 
are equal, Cot Perl Gary and Even ^S, and Choi and Golin Q and ap- 
proximating the optimal solution, Gilbert^. Despite all of these efforts it is 
still, surprisingly, not even known whether the basic problem is polynomial-time 
solvable. 

The only technique other than Karp’s for solving the problem is due to 
Golin and Rote Q who describe an )-time dynamic programming al- 

gorithm that constructs the tree in a top-down fashion. This is the the most 
efficient known algorithm for the case of small /3; in this paper we apply a dif- 
ferent approach by constructing the tree in a bottom-up way and describing 
more sophisticated attacks on the problem. The first attack permits reducing 
the search space in which optimal trees are searched for. The second shows how, 
surprisingly, monotone-matrix concepts, e.g., the Monge property and the 
SMAWK algorithm Q can be utilized. 

Combining these two attacks improves the running time of of | by a factor 
of 0{n^) down to 0{n^). 

Our approach requires a better understanding of the combinatorics of lop- 
sided trees; to achieve this we also introduce the new crucial concept of charac- 
teristic sequences. 

Let 0 < a < p. A tree T is a binary lopsided a,/3 tree (or just a lopsided 
tree) if every non-leaf node u of the tree has two sons, the length of the edge 
connecting u to its left son is a, and the length of the edge connecting u to its 
right son is /3. Figure Jshows a 2-5 lopsided tree. Let T be a lopsided tree and 
V G T some node. Then 

depth{T, v) = sum of the lengths of the edges connecting root{T) to v 
depth(T) = max{dept/i(T, u) : S T} 

For example, the tree in Figure Jhas depth 20. Now suppose we are given a 
sequence of nonnegative weights P = {pi, p 2 , . . . , p„}. Let T be a lopsided tree 
with n leaves labeled v\, V 2 , ■ ■ ■ , Vn- The weighted external path length of the 
tree is 

cost{T,P) = Y.iPr' depth{T,Vi). 

Given P, the problem that we wish to solve is to construct a labeled tree T that 
minimizes cost(T, P) . 

As was pointed out quite early Q this problem is equivalent to finding a 
minimal cost prefix-free code in which the encoding alphabet consists of two 
(or generally, more) unequal cost (length) letters, of lengths a and /3. Also note 
that if a = /3 = 1 then the problem reduces directly to the standard Huffman- 
encoding problem. 

Notice that, given any particular tree T, the cost actually depends upon the 
labeling of the leaves of T, the cost being minimized when pi < P 2 ■ ■ ■ < Pn 
and depth{T,vi) > depth{T,V 2 ) > ■ ■ ■ > depth{T,Vn). We therefore will always 
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Fig. 1. An example 2-5 tree T. The characteristic sequence B = sequenceiT) is 
(2,4,6,7,9,9,10,10,12,13,14,15,16,16,17,17,17,17,17). 



assume that the leaves of T are labeled in nonincreasing order of their depth. 
We will also assume that pi < P 2 < ■ ■ ■ < Pn- 

Note: In this extended abstract we omit many technical proofs. 



2 Combinatorics of Lopsided Trees and Monotonic 
Sequences 

The first crucial concept in this paper is the characteristic sequence of a tree 
T. Denoted by sequenceiT) this is the vector Bt = (&o, ^i, • • • , ^d-i) in which 
bi is the number of right children on or below level i for 0 < i < d, where d is 
the height of the tree, and the levels are enumerated from bottom to top (See 
Figure^. 

Let n and P be fixed. Now let B = 5o, ^i, . . . , ^d-i be any sequence, not 
necessarily one of the form B = Bt defined by some tree T. B is said to be 
monotonic \i d> (3 and 
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Note that the number of right children on or below level i of tree T can not 
decrease with i so for all trees T, Bt is a monotonic sequence. 

A monotonic sequence B of length d terminates in a /3- tuple ( 7 ^ , i , . . . , 71 ) 
if Vj, 0 < j < /3, bd-j = 7 j • Note that if T is a lopsided tree with n leaves then 
T must have n — 1 internal nodes and thus n — I right children. Furthermore the 
top /3 levels of T can not contain any right children. Thus if B = sequence(T) 
for some T then B terminates in a (3 tuple (n— l,n — l,...,n — 1). 




Numbers Nj 



numbers b j 



Fig. 2. The bottom forest Fn of the tree T from Figure Q 



For a monotonic sequence B = bo, bi, . . . , bd-i define 
Nk{B) = bk + bk-(p-a) - bk- 0 , Si = '^Pj, cost{B, P) = 

j<i Q<k<d 

( 1 ) 

If f < 0 or z > n then Si = 00 . For a tree T, denote by Tk = forestk(T) the 
forest resulting by taking all nodes at level k and below (See Figure 2). Denote 
by iVfc(T) the number of leaves in forestk(T). (Note that we have overloaded the 
notation Nk{ ) to apply to both trees and sequences.) 

The following lemma collects some basic facts: 

Lemma 1 Let T be a lopsided tree and B = sequenceiT). Then 
(PI) cost(T, P) = ^Q^k<depth{T) 

(P2) y 0 < i < d = depth{T), N^{T) = bi + bi_(^jj_a) -bi-p 
(where V j < 0, we set bj =0/ 

(P3) cost{T, P) = cost{B, P), 



Proof. 

We omit the proof of (PI) which is straightforward but tedious. To prove (P2) 
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note that Ti is a forest, hence 

Ni{T) = {u £ Fi : u is a leaf in Fi\ (2) 

= Number of internal nodes in Fi F Number of trees in Fi (3) 

The first summand in the last line is easily calculated. A node at height k is 
internal in Fi if and only if it is the father of some right son at level k — (3. Thus 

Number of internal nodes in Fi = bi- 13 . (4) 

The second summand is only slightly more complicated to calculate. The number 
of trees in Fi is exactly the same as the number of tree-roots in Fi. Now note 
that a node in Fi is a tree-root in Fi if and only if its father is not in Fi. Thus 
a right son at height k in Fi is a tree-root if and only if i — f3<k<i and there 
are exactly bi — bi-p such nodes. 

Similarly a left son at height fc is a tree- root if and only if i — a < k < i. 
This may occur if and only if the left son’s right brother is at height k, where 
i — (}<k<i — {(3 — a). The number of such nodes is therefore bi-f^p-a) ~ bi-p. 
We have therefore just seen that 

Number of trees in Fi — {bi — bi-p) + {bi-(p-a) ~ bi-p). (5) 

Combining Q and (Q completes the proof of (P2). (P3) follows from (PI) and 
(P2). 

Now define a sequence B to be legal if B is monotonic and B = sequence{T) 
for some lopsided tree T. The lemma implies that minimizing cost over all legal 
sequences is exactly the same as minimizing cost over all lopsided trees. 

However, not all sequences are legal so this knowledge does not at first seem 
to help us. In the next section we sketch a proof of the following fact. Given any 
minimum-cost monotonic sequence that terminates in the /3-tuple (n — l,n — 
1, . . . , u — 1) it is possible to build a legal sequence with the same cost. Since all 
legal sequences are monotonic this legal sequence must be a minimal-cost legal 
sequence and thus correspond to a minimum-cost tree. In other words, to find 
a minimal-cost tree it will suffice to find a minimum-cost monotonic sequence 
terminating in (n — 1, n — 1, . . . , n — 1). 



3 Relation Between Minimum Sequences and Optimal 
Trees 

We start by assuming that B = sequence{T) for some T. In T the weight p\ 
is associated with some lowest leaf at level 0. The left sibling of this leaf is 
associated with some other weight pk- How can such a fc be identified? 

Observe that this sibling can be the lowest leaf in the tree which is a left-son, 
i.e., the lowest left node in T. Such a node appears on level (3— a (see the left tree 
in Figure^. The number of leaves below this level is bp-a-i, so assuming that 
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we list items consecutively with respect to increasing levels, the lowest left-son 
leaf has index k = FirstLeft{B), where 

FirstLeft{B) = bp-a-i + 1 

We state without proof the intuitive fact that if T is an optimal tree in which 
Pi, pk label sibling leaves, then the tree T' that results by (i) removing those 
leaves and (ii) labeling their parent (now a leaf) with p\ + pk will also be an 
optimal tree for the leaf set P' = P Li {pi + pk} — {pi,Pk} (see the right tree in 
Figure^, Also, calculation shows that 

cost{T, P) = cost{T' , P') + P ■ Pi + a ■ Pk- (6) 

The rest of this section will be devoted to translating this intuition into facts 
about trees and sequences. 

If pi, Pk are siblings in a tree T then denote by T' = merge{T, 1, k) the tree 
in which leaves pi, pk are removed and their parent is replaced by a leaf with 
weight Pi + Pk (see Figure^. We also write unmerge{T' , 1, k) = T. Thus 

cost{unmerge{T\ 1, fc), P) = cost(T' , P') + P ■ pi + a ■ pk- (7) 

For the sequence B = {bo, bi, . . .bj,) denote 



dec{B) = B' = {bo - 1, 6i - 1, &2 - 1, ■ ■ • - !)• 



Note that (after any leading zeros are deleted) this sequence is the characteristic 
sequence of T' = merge{T, l,k). 

Assume P is a sorted sequence of positive integers, x is a positive integer, 
and insert{F, x) is the sequence P with x inserted and sorted (as in insertion 
sort). Now denote by delete{P,pi,pk) the sequence P with elements pi and pk 
deleted, and define 

P' = package -merge{P, 1, k) = insert{delete{P,pi,pk),Pi + Pk)- 

For example if P = {2, 3, 4, 5, 10} then 

P' = delete{P,2,A) = {3,5,10} 
insert{P', 6) = {3, 5, 6, 10} 
package -merge{P, 1, 3) = {3, 5, 6, 10} 

After appropriate manipulations (deleted in this abstract) we derive the fol- 
lowing essential fact: 

Lemma 1 ((key- lemma)). 

Let k = FirstLeft{B) , P' = package-merge{P, 1, k) and B' = dec{B)- Then 



cost{B' , P') < cost{B, P) — P ■ Pi — a - Pk- 
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on these levels 



Fig. 3. The correspondence between trees T, T' and their sequences: T' = 
merge{T,l,3) and sequence(T) = B = (1, 2, 2, 3, 3, 4, 4, 4, 4, 4) sequence{T') = 
dec{B) = B' = (0, 1, 1, 2, 2, 3, 3, 3, 3, 3); FirstLeft{B) = b^-a-i + 1 = ^ 5 - 2-1 + 
1 = 3 and cost{T) = cost(T') + 2p^ + 5pi. 



This lemma permits us to prove that minimum-cost monotonic sequences 
have the same cost as minimum cost trees and permit the construction of such 
trees: 

Theorem 1 ((correctness)). 

Assume B is a minimum cost monotonic sequence terminating in{n—l,n — 
1, . . . , n — 1) for the sequence P. Then there is a tree T such that: 

(1) cost(T, P) = cost(B,P). 

Furthermore if n > 2 then 

(2) FirstLeft{B) is the index of the left brother ofpi in T, 

(3) B' = dec{B) is a minimum cost sequence for P' = package-merge{P, 1, k). 

Proof. 

The proof is by induction with respect to the number n of items in P. If n = 2 
then all legal sequences have the form 

bo = bi = ■ ■ ■ = bd-i = 1 

where d> j3. The sequence with d = (3 has minimum cost and this sequence is 
also the minimum-cost monotonic sequence. 

So now suppose that n > 2. Let B' = dec{B) and T' be a minimum cost tree 
for P' . P' has n — 1 items, so by the induction hypothesis cost{T' , P') equals the 
minimum cost of a monotonic sequence for P' . In particular, by Lemma ^ we 
have 

costfT' , P') < cost{B' , P') < cost{B, P) — a ■ pk — P ■ Pi- ( 8 ) 

Take T = unmerge{T' , 1, fc). Then by Equality Q and Inequality (Q we have: 

cost{sequence(T) , P) = cost{T, P) = cost{T' , P') + a ■ pk + P ■ Pi < cost{B, P). 
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B was chosen to be a minimal cost sequence so all of the inequalities must be 
equalities and, in particular, we find that cost(T,P) = cost{B,P). Hence T is 
the required tree, and this completes the proof of (1). 

We also find that 

cost{T' , P') + a ■ pk + 13 ■ Pi = cost{B, P) 

so plugging back into ^ we find that cost{T',P') = cost{B',P'). Since T' is 
a minimal cost tree for P' the induction hypothesis implies B' is a minimum 
cost sequence for P' proving (3). The proof of (2) follows from the details of the 
construction. 

Note that this theorem immediately implies that, given a minimum-cost se- 
quence B for P, we can construct a minimum-cost tree for P. If n = 2 the 
tree is simply one root with two children. If n > 2 calculate B' = dec{B) and 
P' = package -merge{P, 1, k) in 0{n) time. Recursively build the optimal tree T' 
for P' and then replace its leaf labelled by pi -I- pk by an internal node whose 
left child is labelled by pk and whose right child is labelled by pi . This will be 
the optimal tree. Unwinding the recursion we find that the algorithm uses 0{n?) 
time (but this can easily be improved down to O(nlogn) with a careful use of 
data structures) . 

4 The Monge Property and the Algorithm 

We now introduce the weighted directed graph G whose vertices are monotonic (3- 
tuples of nonnegative integers in the range [0 ... n — 1] . There is an edge between 
two vertices if and only they “overlap” in a (/3— l)-tuple, precisely defined below. 

Suppose if) < i\ < i 2 ^ ■ i/ 3 -i < i/ 3 - Let u = (*o, * 2 , • • ■ , i/ 3 -i) and v = 

{ii, i 2 , . . . , i/ 3 -i, i/ 3 )- Then there is an edge from u to v if u ^ v, and furthermore, 
the weight of that edge is 

Weight{u,v) = EdgeCost{io,ii, ■■ ■,*/?) = 

Observe that if (u, v) is an edge in G then the monotonicity of (* 0 ) * 2 , • ■ • > 

i/ 3 -i, ip) guarantees that u is lexicographically smaller as a tuple than v. In other 
words the lexicographic ordering on the nodes is a topological ordering of the 
nodes of G; the existence of such a topological ordering implies that G is acyclic. 
Note that the /3-tuple of zeros, (0, ... 0), is a source. We refer to this node as the 
initial node of the graph. Note also that the /3-tuple (n — 1, . . . , n — 1) is a sink; 
we refer to it as the final node of the graph. 

For any vertex u in the graph, define cost{u) to be the weight of a shortest 
(that is, least weight) path from the initial node to u. 

Suppose we follow a path from the source to the sink and, after traversing 
an edge (u,v), output ip, the final element of v. The sequence thus outputted is 
obviously a monotonic sequence terminating in the /3-tuple (n— l,n— 1, . . .,n— 1) 
and from the definition of Weight{u, v) the cost of the path is exactly the cost 
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of the sequence. Similarly any monotonic sequence terminating in the /3-tuple 
(n — 1, n — 1, . . . , n — 1) corresponds to a unique path from source to sink in G. 

In particular, given a tree T and B = sequence(T) Lemmajimplies that the 
cost of the path corresponding to B is exactly the same as cost{T) . 

Example. 

The tree T from Figure 3 has B = sequence{T) = (1, 2, 2, 3, 3,4, 4, 4, 4) and its 
corresponding path in the graph G 

(0, 0, 0, 0, 0) ^ (0, 0, 0, 0, 1) ^ (0, 0, 0, 1, 2) 

^ (0, 0, 1, 2, 2) ^ (0, 1, 2, 2, 3) ^ (1, 2, 2, 3, 3) ^ (2, 2, 3, 3, 4) 

^ (2, 3, 3, 4, 4) ^ (3, 3, 4, 4, 4) ^ (3, 4, 4, 4, 4) ^ (4, 4, 4, 4, 4) 

The cost of this path and also of the tree T is 

S'! -I- 2 • 5*2 + S 4 -l- 6 ■ iSs 

The above observations can be restated as 

Observation 2 Assume T is a tree and B = sequence{T). Then cost(T) = 
cost{B) equals the cost of the path in G corresponding to B. 

The correctness theorem and the algorithm following it can can now be re- 
formulated as follows: 

Theorem 2. The cost of a shortest path from the initial node to the final node 
is the same as the cost of a minimum cost tree. Furthermore given a minimum 
cost path a minimum-cost tree can be reconstructed from it in 0{n^) time. 

Observe that G is acyclic and has edges. The standard dynamic- 

programming shortest path algorithm would therefore find a shortest path from 
the source to the sink, and hence a min-cost tree, in time. We now 

discuss how to find such a path in O(n^) time. Our algorithm obviously cannot 
construct the entire graph since it is too large. Instead we use the fact that, 
looked at in the right way, our problem possesses a Monge property. 

A 2-dimensional matrix A is defined to be a Monge matrix Uif for all i,j 
in range 

A{iG) + A{i+l,j + 1) <A{i,j+l) + A{i+l,j) (9) 

Now let 5 = (ii, *2, ■ ■ ■ , */3-i) be any monotonic (/3 — l)-tuple of integers. For 
0 < * < *1 and ifi-i < J < n — 1, define 

EdgeCostg{i,j) = EdgeCost{i,ii, ■ .., i/3-1, j) = Sj+i^-i 

As{i,j) = cost{i,ii, ■ ■ ■ + EdgeCostg{i,j). 

The important observation is that 

Theorem 3 ((Monge-property theorem)). 

For fixed S, the matrix As is a two-dimensional Monge matrix. 
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Proof. 

Let S = (ii, i 2 , . . . , */ 3 -i). We prove Equation Q, where A = As- If the right 
hand side of Equation Q is infinite, we are done. Otherwise, by the definitions 
of the Sk, and of As, canceling terms when possible, we have 

As{i,j + 1) + ^ 45(1 + 1, j) - As{i,j) - As{i + l,j + l) = pj+i^-i+i - pj+i^-i > 0 

which completes the proof. 

A 2 X 2 matrix A is defined to be monotone if either An < An or A 21 > A 22 . 
An n X m matrix A is defined to be totally monotone if every 2x2 submatrix 
of A is monotone. The SMAWK algorithm Q takes as its input a function which 
computes the entries of an n x m totally monotone matrix A and gives as output 
a non-decreasing function /, where 1 < /(*) < m for 1 < i < n, such that 
Aij(i) is the minimum value of row i of A. The time complexity of the SMAWK 
algorithm is 0{n + m), provided that each computation of an Aij takes constant 
time. Note that every Monge matrix is totally monotone so our matrices As are 
totally monotone. This fact permits us to prove: 

Theorem 4 ((Shortest-path theorem)). 

A shortest path from a source node to the sink node in G can be constructed in 
0{n^) time. 

Proof. 

The case where /3 = 1 requires an exceptional proof, because the proof below 
fails if the sequence <5 is a 0-tuple. However, that case is already proved in 
Thus, we assume j3 >2. 

In this extended abstract we actually only show how to calculate the cost 
of the shortest path. Transforming this calculation into the construction of the 
actual path uses standard dynamic programming backtracking techniques that 
we will leave to the reader. 

Our approach is actually to calculate the values of cost{u) for all monotonic [3- 
tuples u = (io,ii, . . ., ip-i). In particular, this will calculate the value of cost{n — 
l,n— l,...,n — 1) which is what is really required. 

For fixed 6 = {i\, 12 , • • • , i/ 3 - 1 ) note that 

Vj>i/3-i, cost{5, j) = Tam{As{i, j) : i < *i} 

Also note that As{i,j) can be calculated in constant time provided the values of 
cost(i, S) is known. This means that, given a fixed 6, if the values of cost(i, 6) are 
already known for all i then the values of cost{5, j) for all j can be calculated in 
total time 0(n) using the SMAWK algorithm. We call this 0{n) step processing 

6. 

Our algorithm to calculate cost{io,ii, . . . , i/ 3 - 1 ) for all /3-tuples is simply to 
process all of the (/3 — 1) tuples in lexicographic order. Processing in this order 
ensures that at the time we process S the values of cost{i, <5) are already known 
for all i. 

Using the SMAWK algorithm it thus takes 0(n) time to process each of the 
0(n^“^) (/3 — l)-tuples so the entire algorithm uses O(n^) time as stated. 
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Algorithm OptimaLTree-Construction 
sequence construction phase: 

compute a shortest path tt from source to sink in G; 
let B be the sequence corresponding to tt; 
tree reconstruction phase: 

construct optimal tree T from B using 
recursive algorithm described following the 
Correctness Theorem 
end of algorithm. 



Theorem 5 ((main result)). 

We can construct a minimum cost lopsided tree in 0{n^) time. 

Proof. 

If /3 = 1 use the basic Huffman encoding algorithm which runs in 0{n) time. 
Otherwise apply the algorithm OptimaLTree-Construction. Theorem Jtells us 
that B can be computed in 0{nl^) time. 

The algorithm described following the Correctness Theorem for constructing 
an optimal tree from B runs in 0{n^) = 0{n^) time completing the proof of the 
theorem. 

We conclude by pointing out, without proof, that the algorithm 
OptimaLTrec-Construction can be straightforwardly extended in two different 
directions: 

Theorem 6. 

We can construct a minimum cost lopsided tree in 0(n-log^ n) time with 0(n^~^) 
processors of a PRAM. 

Theorem 7 ((height limited trees)). 

We can construct a minimum cost lopsided tree with height limited by L in 
0{n^ ■ L) time. 

(A tree with height limited by L is one in which no node has depth greater 
than L.) 
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Abstract. In network commnnication systems, frequently messages are 
routed along a minimum diameter spanning tree (MDST) of the net- 
work, to minimize the maximum delay in delivering a message. When a 
transient edge failure occurs, it is important to choose a temporary re- 
placement edge which minimizes the diameter of the new spanning tree. 
Such an optimal replacement is called the best swap. As a natural exten- 
sion, the all-best-swaps (ABS) problemis the problem of Ending the best 
swap for every edge of the MDST. Given a weighted graph G = (V,E), 
where |U| = n and \E\ = m, we solve the ABS problem in 0{n^/rn) time 
and 0{m -|- n) space, thus improving previous bounds for m = o(n^). 



1 Introduction 

For communication networks, it is important to remain operational even if indi- 
vidual network components fail. In the past few years, the ability of a network 
to survive a transient failure (its survivability) has been studied intensely (an 
excellent survey paper on survivable networks is Q). From the practical side, 
this has largely been a consequence of the replacement of metal wire meshes 
by fiber optic networks: Their extremely high bandwidth makes it economically 
attractive to make networks as sparse as possible. In the extreme, a network 
might be designed as a spanning tree of some underlying graph of all possible 
links. A sparse network, however, is less likely to survive a transient edge (or 
node) failure than a mesh with a multitude of connections that can be used as 

* This research was carried out while the first two authors visited the third author 
within the CHOROCHRONOS TMR program of the European Community. 
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detours. Therefore, it is important for sparse networks to also take survivability 
into account from the very beginning. 

On the theoretical side, this gave rise to an interesting family of problems 
on graphs. In particular, it is important to know in advance how some specific, 
significant feature of the spanning tree changes as a consequence of a transient 
edge failure. For example, if the spanning tree has minimum length (i.e., is a 
minimum spanning tree (MST)), the most immediate feature of interest is the 
total length of the tree itself, and we are interested in finding the edge whose 
removal results in the greatest increase of the length of the MST. This problem 
has been studied (at least implicitly) for a decade, and the fastest solution known 
to date is an 0{m + nlogn) time algorithm where \V\ = n and \E\ = m. 
Another meaningful class of problems arises when the spanning tree is a single- 
source shortest path tree (SPT), which is another popular network architecture. 
In such a case, the problem of finding an edge in the tree whose removal results 
in the largest increase of the distance between the source node r and a given 
node s has been solved efficiently by Malik et al. Q, who gave an 0(m-|- nlogn) 
time algorithm. Recently, Nardelli et al. defined a different parameter for 
measuring the vitality of an edge along a shortest path, looking for the edge 
whose removal will result in the worst detour to reach the destination node, and 
they solved the problem in 0(m + nlogn) time. 

However, in several applications, the used spanning tree is neither an MST 
nor a SPT. Rather, many network architectures look for minimizing the diameter 
of the spanning tree, that is the length of the longest path in the tree between any 
two nodes, so that the maximum propagation delay in the network will be as low 
as possible. Hence, a minimum diameter spanning tree (MOST) is a spanning tree 
having minimum diameter among all the spanning trees. For an MDST subject 
to transient edge failures in the way just described, it makes sense to temporarily 
substitute the failing edge with a single edge which makes the diameter of the 
new spanning tree as low as possible (for practical motivations, see Q). Such an 
optimal replacement is called a best swap. As a natural extension, the all-best- 
swaps (ABS) problem is the problem of finding a best swap for every edge of the 
MDST. 

The related problem of maintaining a spanning tree of small diameter in 
a fully dynamic context, where a tree evolves due to repeated insertions and 
deletions of edges, has been solved in [))]. This problem belongs to the family of 
problems that is concerned with updating the solution of a graph problem after 
dynamic changes of the graph; for a recent survey, consult [^. The approach in 
[7] is more general than what is needed for solving the ABS problem. Hence, if 
we use it for solving the ABS problem, we spend 0{n^) time and 0{m-\-n) space 
and preprocessing time. We get these bounds by computing a best swap in 0{n) 
time for each deleted edge. Recently, Alstrup et al. [7] improved the runtime for 
computing a best swap in an incremental context (i.e., when no deletions are 
allowed) to 0(log^ n). By using some of the results in Q, their approach can be 
adapted to solve the ABS problem, and the obtained runtime is 0(ny7nlog n), 
with 0{m -\- n) space. 
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In this paper we solve the ABS problem in 0{n^/m) time and 0(m+n) space, 
thus strictly improving previous bounds for m = o{n?). This can be done by 
adapting the well-known path halving compression technique for answering 
in 0(1) amortized time (over a total of 0{n^/m) queries) the following query: 
Given a rooted tree T and a pair of nodes y and v in T such that v belongs to 
the subtree of T rooted at y, what is the length of a longest simple undirected 
path in T, starting at v and staying within the subtree of T rooted at yl 

The paper is organized as follows: in Section 2 we define the problem more 
precisely and formally. Section 3 proposes the algorithm for solving the problem. 
Finally, in Section 4, we present the adaptation of the path halving compression 
technique which allows to efficiently solve the ABS problem. 

2 Problem Statement 

Let G = (y, E) be a biconnected, undirected graph, where V is the set of vertices 
and E CV xV is the set of edges, with a nonnegative real length |e| associated 
with each edge e G E. Let n and m denote the number of vertices and the 
number of edges, respectively. A graph H = {V , E') is called a subgraph of G if 
V' C V and E' Q E. li V = V then El is called a spanning subgraph of G. A 
connected acyclic spanning subgraph of G is called a spanning tree of G. 

A simple path (or a path for short) in G = (y E) is a subgraph (y', E') with 
V = {di, . . . , Vk\vi y Vj for i y j} and E' = {{vi, r>i+i)|l < i < k}, also denoted 
as (wi, . . . , Vk)- Path {vi, . . Vk) is said to go from vi to Vk- Because there is 
exactly one path between any two nodes in a tree, let us denote in this case the 
length |(r;i, . . . , Vk)\ of the path as d{vi, Vk)- 

Let T>(T) = (di, d. 2 , ■ ■ ■ , dfe) denote a diameter path of T, that is a path whose 
length \D{T) \ is maximum in T. We call \D{T) \ the diameter of T. For simplicity, 
we will use the term diameter also for the diameter path, whenever no confusion 
arises. A minimum diameter spanning tree of G is a spanning tree of G having 
minimum diameter. If e S Et is a tree edge, a replacement edge (or swap edge) 
for e is an edge f G E \ Et such that Tgj = (y Et — e + f) is a spanning tree 
of G. Let TZe denote the set of replacement edges for e. A best swap for e is an 
edge r(e) G TZe such that 



\V{Te,r(e))\ = jnmmTej)]}. 

The all-best-swaps (ABS) problem is the problem of finding a best swap for 
every edge e € Et- 

3 Solving the ABS Problem 

To solve the ABS problem efficiently, we will exploit relationships among the 
original spanning tree T and the replacing ones. Let dc, with 1 < c < fc, be 
the center of the diameter path, that is the node in T>(T) for which \d{di, dc) — 
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d{dc, dk)\ is minimum. If this node is not unique, we take the node farther from 
di . Let T denote a source directed tree obtained by rooting T in dc and orienting 
the edges towards the leaves. Following we maintain an augmented topology 
tree and an augmented 2-dimensional topology tree to efficiently retrieve 
only 0{^Jrn) selected edges among the 0(m) replacement edges, whenever an 
edge e in T is deleted. In fact, among the selected edges, a best swap is contained. 



3.1 The algorithm 

The general outline of our algorithm is the following: 

Step 0: Perform preliminary computations. 

Step 1 : For each edge e € T as considered by a postorder visit 

Step 2: Delete e; update the topology and the 2-dimensional topology tree. 

Step 3: Compute the set of selected replacement edges Se ^'R-e- 

Step 4: For each edge / e 5e, compute \V{Tej)\ and select a best swap. 

Step 5 : Insert e and update the topology and the 2-dimensional topology tree. 

Step 0 requires 0{n+m) time and space, as we show next. Notice that in Step 
1 we consider all the 0(n) edges of the tree, in the order they are generated by a 
postorder tree visit (this order is needed for a correct path halving compression, 
as is shown in Section 4). Steps 2, 3 and 5 can be accomplished in 0{^/m) 
time, and |5e| = 0{^/m) Q. Concerning Step 4, we will show that \T>{Tej)\ 
can be computed in 0(1) amortized time. This can be done using the path 
halving compression technique that will be presented and analyzed in detail in 
Section 4. This technique, given a rooted tree T and a pair of nodes u and 
u in T such that u belongs to the subtree of T rooted at v, allows to find in 
0(log(]^^Q/„) n) amortized time per query (over a total of Q queries) the length 
Findpath{u, v) of a longest simple undirected path starting from u and staying 
within the subtree rooted at v. Since for solving the ABS problem we will prove 
that Q = 0{n^/rn), Findpath{u,v) can be computed in 0(1) amortized time. 
Thus, Step 4 costs 0{^/m) amortized time, and the global time for solving the 
ABS problem is 0{nydm), using 0(n -I- m) space. 

3.2 Preliminary computations 

Let Ct) = (c?i, . . . , dc-i, dc) and TZt> = {dc, dc+i, •■•,,, dk)- W.l.o.g., let us as- 
sume \Cx>\ > \'Rt>\- We mark in 0(n) time all the nodes in T rooted at dc-i, say 
Ni, and all the nodes rooted at dc+i, say Nr, with the nearest node on the diame- 
ter from which they descend (if a node is on the diameter, then it is marked with 
itself). Nc will denote the remaining nodes, marked with dc- Figure ^depicts 
this notation. 

Then, we associate with each node v G T its distance from dc, and the 
lengths hi{v),i = 1,2, of the two longest directed paths in T emanating from 
V and making use of two different subtrees of v, if they exist. For each of these 
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paths we also store the nodes ai{v),i = 1, 2, adjacent to v. Let ai(t;)] and 

[/ 12 (f), a 2 (t')] be these pairs of values, with hi{v) > / 12 (f)- With the root, we also 
associate a further value, say h^{dc), corresponding to the length of a longest 
path in T starting from dc and not using dc~i and dc+i- After, we associate with 
each node dj G £x> {dj G TZv) a further value, say X{dj), containing the length 
of a longest path in T starting from dc and containing neither dc+i (dc-i) nor 
the edge {dj, dj-i) {{dj, dj+i))- Note that since dc belongs to both £x> and TZd, 
X{dc) coincides with h‘i{dc)- We express \{dj) recursively as follows: 

\{dc) = h^{dc) 

X{dj) = max(A(c/j+i), d{dc, dj) + h 2 {dj)) for j = c — 1, . . . , 1 

X{dj) = max(A(c/j_i), d{dc, dj) + h 2 {dj)) for j = c + 1, . . . , k. 

Next, we associate with each node dj G Cj) {dj G TZ'd) a further value, say ^i{dj), 
containing the length of a longest path in T starting from d\ {dk) and staying 
within the subtree of T rooted at dj. We express n{dj) recursively as follows: 

H{di) = fi{dk) = 0 

n{dj) = max(/r((/j_i), d{d\, dj) + h 2 {dj)) for j = 2, . . . , c — 1 

n{dj) = max(/r(c/j+i), d{dk, dj) + h 2 {dj)) for j = fc — 1, . . . , c + 1. 

We also associate with each node on the diameter a further value, say p{dj), 
containing the nearest node along the diameter of the path stored in fji{dj) (this 
can be done during the computation of p,{dj)). It is easy to see that all the above 
computations cost 0{n) time. 

Finally, we convert G into a graph G' with maximum vertex degree 3 Q, and 
we derive from T a spanning tree T' of G' (see ^ for further details) . Then, we 
associate to T' a topology tree and an augmented 2-dimensional topology tree 
which can be initialized in 0{m) time and space. It is easy to see that 
an edge swap in T corresponds to an edge swap in T' , and vice versa. Therefore, 
in the following we will continue to refer to the original spanning tree T, even 
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though our algorithm makes use of the topology tree and the 2-dimensional 
topology tree of T' . 

Summarizing, preliminary computations have an overall cost of 0{m + n) 
time and use 0(m + n) space. 

3.3 Computing |X>(Te,/)| in 0(1) amortized time 

In the rest of the paper, two paths will be considered adjacent if they share the 
root dc only. When the edge e = {x, y) is removed, T is split into two subtrees, 
say Tx and Ty, which will be later connected by means of a replacement edge 
f = (u,v). As a. consequence 

\V{T,j)\ = max{|I?(T,)|, |I?(r,)|, \Vf\} (1) 

where \Vf\ is the length of a longest path in Tgj passing through /. We now 
analyze different cases that can arise in solving the ABS problem. For the sake 
of clarity, we perform a different analysis depending on whether the removed 
edge is located on the diameter or not. 

3.3.1 The removed edge is not on the diameter 

Assume the edge e = (x, y) is removed, where x is closer to dc than y and 
e ^ V(T). In this case, neither Ct> nor TZx> are affected. Trivially, T>{Tx) = T>{T). 
Moreover, |I?(rj,)| < |I?(T)|, since a diameter in Ty is a diameter in T too. It 
then remains to compute \Vf\, for any selected replacement edge f G Se- Let 
/ = (m, v), where u G and v € Ty. It is clear that 

l^/l = l>C„| + I/I + |T.| 

where £„ is a longest path in starting from u and Cy is a longest path in 
Ty starting from v. Since u is a descendant of y in T, by using the path halving 
compression technique \Ly\ = Findpath{v, y) can be computed in 0(1) amortized 
time, while |/| is clearly available in 0(1) time. It remains to compute \Cu\- The 
following claim is easy to prove: 

Lemma 1. At least one of the longest paths in T^ starting from u contains dc- 

Proof. Suppose, for the sake of contradiction, that none of the longest paths 
in Tx starting from u contains dc- Let us restrict our attention to any one of 
such longest paths, say Vu- We will show that such a path can be modified into 
another path at least as long as Pu and passing through dc, from which the claim 
will follow. Let w be the node in Vu nearest to dc, and let ^ be the ending node 
of Vu other than u. Three cases are possible: 

1. w G Np. let q G Cx> be the node on V{T) nearest to w (if w is on the diameter, 
then q = w). It is trivial to see that in this case, being q ^ dc since w G Ni, 
it must be 



d{q,z) < d{q,dk) 
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since otherwise d(di, q)+d{q, z) > d{di, q)+d{q, dk) = \V{T)\. Being d{q, z) = 
d{q,w) + d{w,z), it then follows that Vu can be modified into the path 
'Pu = (u, If, . , dk) containing dc and such that 

\Vu\ = d{u, w) + d{w, q) + d{q, dk) > d{u, w) + d{q, dk) > 

> d{u, w) + d{q, z) > d(u, w) + d{w, z) = \Vu\ 
that is a contradiction. 

2. w G Nr : this case is symmetric to the first one. 

3. w G Nc- in this case, it must be clearly d{w, z) < d{w, dc) + d{dc, di), since 

otherwise d{dc, di) + d{dc, w) + d{w, z) > |D(T) | . It then follows that Vu can 
be modified into the path Vj = {u, . . . , dc, ■ ■ ■ , di), containing dc and 

such that 

\Vu'\ = d{u, w) + d{w, dc) + d{dc, di) > d{u, w) + d{w, z) = \Vu\ 

that is a contradiction. □ 

^From the above analysis and from the fact that Cd is one of the longest 
paths emanating from dc in T and that IZx) is one of the longest paths emanating 
from dc in T which does not make use of dc-i it follows that 

I „ I J d(dc, u) + [R-d I if u G iV; 

' \d{dc,u) + \Ct>\ otherwise 

and therefore, it follows that \Cu\ is available in 0(1) time. Summarizing, \Vf\ 
can be computed in 0(1) amortized time, and 

\V{Tcj)\=^s.^{\V{T)\,\Vf\). 

Once this value is computed for the 0{^/m) selected edges identified by the 
augmented topology and 2-dimensional topology tree, a best replacement edge 
is available. Therefore, the case e ^ T>{T) can be managed in 0{^/m) amortized 
time. 

3.3.2 The removed edge is on the diameter 

We will analyze the case in which e = (di,di-±) G £x> is removed, since the 
case e G TZt> is symmetric. When e is removed, T is split into two subtrees, say 
Td- and which will be later connected by means of a replacement edge 

/ = (it, v) G Sc- Equation Q becomes 

\V{Tcj)\ = max(|D(T,J|, |D(7d._J|, \Vf\). 

Let us now analyze the value of these three terms. 

• \V{Td-)\: We start by proving the following fact: 
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Lemma 2. At least one of the diameters ofT^- contains dk G 'R-v- 

Proof. In fact, for the sake of contradiction, suppose that none of the diameters 
of T(i- contains dk- Let us restrict our attention to any one of such diameters, 
say P. We will show that such diameter can be modified into a path containing 
dk and at least as long as P, from which the claim will follow. Let w be the 
node in P nearest to dc- Let Pi and P 2 be the two subpaths in which P splits 
with respect to w, with ending nodes Zi and Z 2 , respectively, and suppose that 
z\ precedes Z 2 in a preorder traversal of T . Three cases are possible: 

1. w G Ni'. this case is similar to the case 1 of the proof of Lemma ^ In fnct, 
the modified path there built also contains dk, apart from dc- 

2. w G Nr', in this case, let q G Pv be the node on P{T) nearest to w (if w is 
on the diameter, then q = w). Clearly, any path emanating from g in T is 
no longer than d{q, dk). In particular 

d{q, dk) > d{q, zi) > d{w, Zi) 

and 

d{q,dk) > d{q,Z 2 ) > d(w, Z 2 ). 

Since either PiL) {w, . . . , q) or P 2 G) {w, . . - ,q) (or both of them) is adjacent to 
{q, ... , dk), it follows that P can be modified into a no shorter path containing 
dk, that is a contradiction. 

3. w G Nc'. in this case, since Z 2 descends from dc in T, it must be clearly 
d{w, Z 2 ) < d{w, dc) + d{dc, dk), since otherwise d{dc, di) + d{dc, w) + d{w, z) > 
\P{T)\. It then follows that P can be modified into the path P' = {zi, . . . , dc, 
. . ., dk) containing dk and such that 

\P'\ = d{zi,w) + d{w, dc) + d{dc, dk) > d{zi,w) + d{w, Z 2 ) = \P\ 

that is a contradiction. □ 

^From the above result, it follows that a longest path starting from dk and 
not using the edge e just removed can be computed 0(1) time as 

\T>{Td,)\ = max(/i(4+i), A(di) + \Pt>\)- 

• |T>(Td._ J|: analogously to the previous case, it can be proved that at least one 
of the diameters of Td._^ must contain the node d±. Therefore, it will be 

\P{Td,_J\ = Kd,.i) 
which can be computed in 0(1) time. 

• \Pf\: let be / = {u, v), where u G Td- and v G Td^_,^, and \Pf \ = \£u\ + \f\ + \Pv\- 
\Cy\ = Findpath{v, di-i) can be computed in 0(1) amortized time and |/| is 
available in 0(1) time. It remains to analyze |T«|. Remember that to the node 
u is associated the nearest node q on the diameter from which it descends. The 
following three situations are possible for u: 
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1. u G Nf. in this case, still using the same arguments as for the point 1 of the 
proof of Lemma B it can be easily proved that at least one of the longest 
paths in Td- starting from u must contain dc, and then |£„| can be obtained 
in 0(1) time as 



\Cu\ = d{dc,u) + \TZt,\. 

2. u G Nr'. In this case, still using the same arguments as for the point 2 of the 
proof of Lemma | it can be proved that at least one of the longest paths 
in Trf. starting from u must contain q. Therefore, it follows that |£„| can be 
obtained in 0(1) time as (note that the following is equivalent to compute 
Findpath(u, dc)) 



\£u\ = max \^d{u, q) + d(q, p(dc+i)) + K^-c+i) ~ d{dk, p(dc+i)) , 
d{u, dk), d{u, dc) + X{di) 



3. u G Nc- in this case, it is easy to see that at least one of the longest paths 
in Td. starting from u must contain dc- In fact, for the sake of contradiction, 
suppose that none of the longest paths in T^. starting from u contains dc- 
Let us restrict our attention to any one of such longest paths, say Let w 
be the node in Vu nearest to dc, and let 2 ; be the ending node of Vu other 
than u. Clearly, d{w, z) < d{dc, dk), and therefore Vu can be modified into 
the path Vu = {u, ... ,w, ... ,dc, ■■■ , dk), containing dc and such that 

\Vu\ = d{u, w) + d{w, dc) + d{dc, dk) > d{u, w) + d{w, z) = \Vu\ 

that is a contradiction. Given that Cu contains dc, it remains to compute 
the length of a longest path starting from dc and not containing u, and this 
can be done by looking at \TLx>\ and at \{di) (note that if \{di) is exactly 
the length of a path passing through u, then it follows that \R,x>\ > X{di)). 
Thus, \Cu\ can be computed in 0(1) time as 

\Cu\ = max(d(dc,M) + \V.v\,d{dc,u) + \{di)). 

Summarizing, the case e G Cd can be managed in 0(1) amortized time for any 
of the 0{,/m) selected edges. Since the case e G TZ-d is symmetric to the previous 
one, it can be managed with the same amortized runtime. Repeating the above 
for all the n — 1 edges of T we therefore have the following: 

Theorem 1. The ABS problem for a minimum diameter spanning tree T of 
a graph G with n vertices and m edges can he solved in 0{n,/rn) time, using 
0{m + n) space. □ 



64 



E. Nardelli, G. Proietti, P. Widmayer 



4 Constructing and Using Compressed Paths 



In this section we use an adaptation of the well-known path halving compres- 
sion technique to prove that a Findpath(v, y) operation can be satisfied in 0(1) 
amortized time, as required to solve efficiently the ABS problem. 

We start creating a virtual forest T of trees. Initially, T is composed of n 
singletons, each one associated to a node v G V. With each node u in IF the 
following values are associated: [hi{v) , ai{v)] , i = 1,2, as defined in Section 
and up{v), which will contain an estimation of the length of a longest path 
emanating from v and ascending towards dc in T, now considering the edges of 
the path from dc to v as directed from v towards dc- At the beginning, up{v) = 
0, Wv G J-. The following instructions manipulate T\ 

— Link(u,v)\ combine the trees with roots u and v into a single tree rooted in 
u, adding the edge e = {u, v) of length |e|; 

— Eval{v): Return the length of a longest path starting from v in the tree 
containing it and apply a suited path halving compression technique. 

Note that Eval{y) assumes that a pointer to element v is obtained in constant 
time. The sequence of Linkf) operations in T is determined by the sequence of 
edge removals from T. Remember that we sequentially consider all the edges 
e G Et in postorder fashion. When the edge e = {x, y) is (temporarily) removed, 
we perform a sequence of Link{y, zi) in T, where Zi,i = 1, . . . , fc are all the sons 
of y in T. This means that for any given node v G V, whenever a Findpath(v, y) 
in T occurs, the node y is exactly the root of v in T . In fact, remember that we 
only ask Findpath{v, y) on nodes descending from the currently removed edge. 
This observation is crucial for the correctness of the method. We implement 
a Findpath(v, y) operation in T by means of an Eval(v) operation in IF, that 
examines all the nodes along the path from v to the root y of the tree containing it 
and compresses such a path. The compression technique used is an adaptation of 
the path halving technique , which will guarantee the associativity of Eval(v) . 
Let us describe how the path halving works. Given a couple of nodes u,v G V, 
with u ancestor of u (u ^ v) in T, we define the following function 

u(,. ,,'1 = / ^2(u) if ai{u) < V 

’ ( /ii(u) otherwise. 

It is easy to see that h(u,v) can be computed in 0(1) time, for any u,v gV, 
after 0(n) time of preprocessing of T. Assume an Eval(y) operation occurs and 
this is the first time an Eval{) operation takes place. For the sake of simplicity, 
let us focus on a path of three nodes (y, u, v), with y ^ u ^ v in T and such that 
|(u,u)| = a and |(y, u)| = /?• We have 

Eval{v) = max {hi{y), up{v), a + h{u, v), a + up{u), a+ (3+h{y^ u), a+ j3+up{y'^ . 
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The compression takes place as part of this operation, and replaces the edge 
(it, v) with an edge {y, v) of length a + /3 and the label upiv) of v with 

up{v) = max ^up{v), a + /i(it, it), a + up{u)J . 

After the compression we hence have 

Eval{v) = max ^hi(n), max ^np(n), q+/i(m, v),a-\-up{u^ , a+/3+h(y, n), Q+/3+iip(y)^ 



exactly as before the compression. Therefore, we can conclude that Evaliv) com- 
presses paths while correctly maintaining longest path information. 

In the general case, let {vq = f,ui, . . .,Ufc = y) be the path from v to the 
root y of the tree containing v in JF, having edges Ci = (i^j, fi+i), i = 0, . . . , /c— 1. 
We have 

i-l i-1 

Eval{v) = vaayi^ (jii{v),up{v),h{vi,v) + E \ej\,up{vi) + Ei^^i)- (2) 

j=0 j=0 

W.l.o.g., let k be even. The path halving technique makes every other node 
along the path (except the last and the next to last) point to its grandfather, 
i.e., replaces the edge {vi,Vi+i),i = 0, 2,...,fc — 2 with the edge (vi,Vi+ 2 ) of 
length |ei| -I- |ei+i| and sets 

up{vi) = max (up{vi), \ci\ + h{vi+i,Vi), \ci\ + up{vi+i)^,i = 0, 2, . . ., fc - 2. (3) 

Hence, after the halving, the path between v and y is (uq = f, W 2 , ■ ■ • , Vk- 2 , Vk = 
y), with edges e( = (ui, Wi+ 2 ), i = 0, 2, . . . , A; — 2 of length \ei\ + |ej+i |. Therefore, 
after the halving, we have 

( 2i-l 2i-l 

hi{v),up{v),h{v 2 i,v) + \ej\,up{v 2 i) +/ \ej\ 
j=0 j=o 

which, from Q, is equivalent to (B Therefore, it turns out that Evaliv) before 
and after the halving is invariant. Remembering the order the edges are removed, 
it is clear that the compression of the paths works correctly (i.e., the compression 
incrementally proceeds towards the upper levels of T, according to the edge 
removals). Therefore, we conclude that Eindpath(v,y) = Eval(v). 

Since naive linking in T must be applied to preserve paths of T, a sequence 
of <5 > n Eval{v) queries in T can be satisfied in 0(<51og(-]^^Q/„) n) time 
Therefore, to establish that a single query can be satisfied in 0(1) amortized 
time, it remains to prove that Q = I?(n^+'^) for some constant e > 0. We now 
informally show that Q = i.e., e > 1/2. 
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Let us distinguish the edges of T in basic edges (i.e., edges contained inside 
a basic cluster Q and therefore associated to a node at level £ = 0 in the 
corresponding topology tree) and spanning edges (i.e., edges joining two clusters 
at the same level £ > 0 of the topology tree). If an edge removed from T is a 
basic edge, then we have |5e| = 0(-^m). On the other hand, if an edge removed 
from T is a spanning edge and joins two clusters corresponding to nodes at level 
£ in the topology tree, 0 < £ < [log-y/ml — 1, then we have |5e| = 

For each edge in Se a Findpath{) query is issued. Hence the minimum overall 
number of queries is obtained when as many as possible of the n — 1 failing edges 
of T are spanning edges at the highest possible level in the topology tree. Since 
there will be spanning edges for nodes at level £ of the topology 

tree, i.e., a total of 0{y/m) spanning edges, it follows that the minimum number 
of queries is posed when 0{^/rn) edges of T are spanning and the remaining 
0{n — \/m) are basic. Therefore, we have 



Q ^ f2 







E 



i=0 



^/m ^/rr^ 

2£+T^ 



f2(n^/m) 



and since Q — 0{n^/m), it follows Q = which implies e > 1/2. 
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